Calling two forward modules asynchronously in parallel in PyTorch - parallel-processing

I have to compute the output of two modules in parallel. There is no dependency between the two processing. Is it possible in PyTorch? I only need to do this during eval without batch processing.
My current code looks like this (input1.shape[0] = 1 and input2.shape[0] = 1)
x = self.layer1(input1)
y = self.layer2(input2)
I am wondering if something like this is feasible -
x, y = call_in_parallel(self.layer1(input1), self.layer2(input2))
Note that I do not need this functionality for training the model. I only need it during the evaluation phase during which I process the inputs in a batch size of 1.
Plus, in my case, I do not have simple MLP layers. I will like to call separate forwards on more complex models like GNNs or transformers.

Related

What is good way of using multiprocessing for bifacial_radiance simulations?

For a university project I am using bifacial_radiance v0.4.0 to run simulations of approx. 270 000 rows of data in an EWP file.
I have set up a scene with some panels in a module following a tutorial on the bifacial_radiance GitHub page.
I am running the python script for this on a high power computer with 64 cores. Since python natively only uses 1 processor I want to use multiprocessing, which is currently working. However it does not seem very fast, even when starting 64 processes it uses roughly 10 % of the CPU's capacity (according to the task manager).
The script will first create the scene with panels.
Then it will look at a result file (where I store results as csv), and compare it to the contents of the radObj.metdata object. Both metdata and my result file use dates, so all dates which exist in the metdata file but not in the result file are stored in a queue object from the multiprocessing package. I also initialize a result queue.
I want to send a lot of the work to other processors.
To do this I have written two function:
A file writer function which every 10 seconds gets all items from the result queue and writes them to the result file. This function is running in a single multiprocessing.Process process like so:
fileWriteProcess = Process(target=fileWriter,args=(resultQueue,resultFileName)).start()
A ray trace function with a unique ID which does the following:
Get an index ìdx from the index queue (described above)
Use this index in radObj.gendaylit(idx)
Create the octfile. For this I have modified the name which the octfile is saved with to use a prefix which is the name of the process. This is to avoid all the processes using the same octfile on the SSD. octfile = radObj.makeOct(prefix=name)
Run an analysis analysis = bifacial_radiance.AnalysisObj(octfile,radObj.basename)
frontscan, backscan = analysis.moduleAnalysis(scene)
frontDict, backDict = analysis.analysis(octfile, radObj.basename, frontscan, backscan)
Read the desired results from resultDict and put them in the resultQueue as a single line of comma-separated values.
This all works. The processes are running after being created in a for loop.
This speeds up the whole simulation process quite a bit (10 days down to 1½ day), but as said earlier the CPU is running at around 10 % capacity and the GPU is running around 25 % capacity. The computer has 512 GB ram which is not an issue. The only communication with the processes is through the resultQueue and indexQueue, which should not bottleneck the program. I can see that it is not synchronizing as the results are written slightly unsorted while the input EPW file is sorted.
My question is if there is a better way to do this, which might make it run faster? I can see in the source code that a boolean "hpc" is used to initiate some of the classes, and a comment in the code mentions that it is for multiprocessing, but I can't find any information about it elsewhere.

Choosing a better parallel architecture in Python

I am working on Data Wrangling problem using Python,
which processes a dirty Excel file into a clean Excel file
I would like to process multiple input files by introducing concurrency/parallelism.
I have the following options 1) Using multiThreading 2) Using multiProceesing modules 3) ParallelPython module,
I have a basic idea of the three methods, I would like to know which method is best and why?
In Bref, Processing of a SINGLE dirty Excel file today takes 3 minutes,
Objective : To introduce parallelism/concurrency to process multiple files at once.
Looking for, best method of parallelism to achieve the objective
Since your process is mostly CPU bound multi-threading won't be fast because of the GIL...
I would recommend multiprocessing or concurrent.futures since they are a bit simpler the ParallelPython (only a bit :) )
example:
with concurrent.futures.ProcessPoolExecutor() as executor:
for file_path, clean_file in zip(files, executor.map(data_wrangler, files)):
print('%s is now clean!' % (file_path))
#do something with clean_file if you want
Only if you need to distribute the load between servers then I would recommend ParallelPython .

Julia parallel programming - Making existing function available to all workers

I am faced with the following problem:
I have a function called TrainModel that runs for a very long time on a single thread. When it finishes computing, it returns a function as an output argument, let's call it f. When I enquire the type of this f, Julia returns:
(generic function with 1 method)
(I am not sure of this last piece of information is useful to anyone reading this)
Now in a second step, I need to apply function f on a very large array of values. This is a step that I would like to parallelise. Having had started Julia with multiple processes, e.g.
julia -p 4
ideally, I would use:
pmap(f, my_values)
or perhaps:
aux = #parallel (hcat) for ii=1:100000000
f(my_values[ii])
end
Unfortunately, this doesn't work. Julia complains that the workers are not aware of the function f, i.e. I get a messsage:
ERROR: function f not defined on process 2
How can I make function f available to all workers? Obviously a "dirty" solution would be to run the time-consuming function TrainModel on all workers, like this perhaps:
#everywhere f = TrainModel( ... )
but this would be a waste of cpu when all I want is that just the result f is available to all workers.
Though I searched for posts with similar problems, so far I could not find an answer...
Thanks in advance!
best,
N.
The approach to return the function seems elegant but unfortunately, unlike JavaScript, Julia does not resolve all the variables when creating the functions.
Technically, your training function could produce the source code of the function with literal values for all the trained parameters. Then pass it to each of the worker processes, which can parse it in their environment to a callable function.
I suggest to return a data structure that contains all the information to produce the trained function: weights of an ANN, support vectors, decision rules ...
Define a the "trained" function on the worker processes, such that it will utilized the trained parameters. You might want to have the ability of saving the results of the training to disk anyway, so that you can easily re-produce your computations.
There is a Unix-only solution based on the PTools.jl package (https://github.com/amitmurthy/PTools.jl).
It relies on parallelism via forking instead of the Julia in-built mechanism. Forked processes are spawned with the same workspace as the main process, so all functions and variables are directly available to the workers.
This is a similar to the Fork clusters in R parallel package, so it can be used as the mclapply function.
The function of interest is pfork(n::Integer, f::Function, args...) and one noticeable difference with mclapply in R is that the function f must take as first argument the index of the worker.
An example:
Pkg.add("PTools")
Pkg.checkout("PTools") #to get the last version, else the package does not build at the time of writing
using PTools
f(workid,x) = x[workid] + 1
pfork(3, f, [1,2,3,4,5]) #Only the three first elements of the array will be computed
3-element Array{Any,1}:
2
3
4
I expect that an interface to pfork will be built so that the first argument of the function will not need to be the index of the worker, but for the time being it can be used to solve the problem

Parallel processing in condor

I have a java program that will process 800 images.
I decided to use Condor as a platform for distributed computing, aiming that I can divide those images onto available nodes -> get processed -> combined the results back to me.
Say I have 4 nodes. I want to divide the processing to be 200 images on each node and combine the end result back to me.
I have tried executing it normally by submitting it as java program and stating the requirements = Machine == .. (stating all nodes). But it doesn't seem to work.
How can I divide the processing and execute it in parallel?
HTCondor can definitely help you but you might need to do a little bit of work yourself :-)
There are two possible approaches that come to mind: job arrays and DAG applications.
Job arrays: as you can see from example 5 on the HTCondor Quick Start Guide, you can use the queue command to submit more than 1 job. For instance, queue 800 at the bottom of your job file would submit 800 jobs to your HTCondor pool.
What people do in this case is organize the data to process using a filename convention and exploit that convention in the job file. For instance you could rename your images as img_0.jpg, img_1.jpg, ... img_799.jpg (possibly using symlinks rather than renaming the actual files) and then use a job file along these lines:
Executable = /path/to/my/script
Arguments = /path/to/data/dir/img_$(Process)
Queue 800
When the 800 jobs run, $(Process) gets automatically assigned the value of the corresponding process ID (i.e. a integer going from 0 to 799). Which means that your code will pick up the correct image to process.
DAG: Another approach is to organize your processing in a simple DAG. In this case you could have a pre-processing script (SCRIPT PRE entry in your DAG file) organizing your input data (possibly creating symlinks named appropriately). The real job would be just like the example above.

Design to distribute work when generating task oriented input for legacy dos application?

I'm attempting to automate a really old dos application. I've decided the best way to do this is via input redirection. The legacy app (menu driven) has many tasks within tasks with branching logic. In order to easily understand and reuse the input for these tasks, I'd like to break them into bit size pieces. Since I'll need to start a fresh app on each run, repeating a context to consume a bit might be messy.
I'd like to create an object model that:
allows me to concentrate on the task at hand
allows me to reuse common tasks from different start points
prevents me from calling a task from the wrong start point
To be more explicit, given I have the following task hierarchy:
START
A
A1
A1a
A1b
A2
A2a
B
B1
B1a
I'd like an object model that lets me generate an input file for task "A1b" buy using building blocks like:
START -> do_A, do_A1, do_A1b
but prevents me from:
START -> do_A1 // because I'm assuming a different call chain from above
This will help me write "do_A1b" because I can always assume the same starting context and will simplify writing "do_A1a" because it has THE SAME starting context. What patterns will help me out here? I'm using ruby at the moment so if dynamic language features can help, I'm game.
EDIT: after re-reading your question, I realized I misunderstood it. Let me answer what you actually asked...
I would create a hierarchy of classes. The simplest ones would be have functions like "do task A1b" that would output the appropriate steps to accomplish this. On top of that, I would build functions that would call the sub-tasks in specific orders to accomplish specific goals.
Pretending VIM was the program being controlled, the first level tasks would be things like 'Enter insert mode' 'Enter command mode' 'write the file' or 'input this arbitrary set of inputs'. On top of this I would build functions like 'insert "foobar" into the open file at the start of line 5' which would call the lower-level tasks.

Resources