Parallelization of torch.stack inside forwad pass - parallel-processing

Inside a forward pass of a Neural Network I have this snippet of code.
myNewTensor = th.stack([
self.my_function(inputA[i], inputB[i] for i in range(inputA.shape[0])
)
])
This is obviously embarrassingly parallel and there is no way to vectorize the function. Is there a way to parallelize this operation in pytorch while still being able to backpropagate through it?

Related

For loops run SUPER slow in Julia when outside a function

There is a very peculiar slow down in Julia. When running, for example, a for loop by calling a function
function TestFunc(num)
for i=1:num
end
end
It is MUCH faster than when I just run a for loop for the exact same num ...
for i=1:num
end
The slow down isn't marginal either, it is magnitudes slower, the following image shows me running it.
For Loop Code
In some of my other code, the opposite actually happens but I just feel like I am missing something fundamental about the way Julia runs. How do I keep my code optimal and why do these differences exist?
Anything you can write outside a function, you can write inside a function. So just like in C, you can write
function main()
print("Hello World\n")
end
main()
So just pretend it is a C program and write your stuff inside the main() function.
Why is it so slow outside a function, it is because any variable inside a function is protected from being changed by another thread or task. So for a for loop in the global scope must check its variables for its type everytime it is access by the for loop, just in case it was change by another thread or task. All these checking is slowing it down FOR SAFETY.
The first Performance Law of Julia is
Global is slow
The performance tips in the Julia Documentation says
A global variable might have its value, and therefore its type, change at any point. This makes it difficult for the compiler to optimize code using global variables. Variables should be local, or passed as arguments to functions, whenever possible.
Any code that is performance critical or being benchmarked should be inside a function.

How to Synchronize with Julia CUDArt?

I'm just starting to use Julia's CUDArt package to manage GPU computing. I am wondering how to ensure that if I go to pull data from the gpu (e.g. using to_host()) that I don't do so before all of the necessary computations have been performed on it.
Through some experimentation, it seems that to_host(CudaArray) will lag while the particular CudaArray is being updated. So, perhaps just using this is enough to ensure safety? But it seems a bit chancy.
Right now, I am using the launch() function to run my kernels, as depicted in the package documentation.
The CUDArt documentation gives an example using Julia's #sync macro, which seems like it could be lovely. But for the purposes of #sync I am done with my "work" and ready to move on as soon as the kernel gets launched with launch(), not once it finishes. As far as I understand the operation of launch() - there isn't a way to change this feature (e.g. to make it wait to receive the output of the function it "launches").
How can I accomplish such synchronization?
Ok, so, there isn't a ton of documentation on the CUDArt package, but I looked at the source code and I think it looks straightforward on how to do this. In particular, it appears that there is a device_synchronize() function that will block until all of the work on the currently active device has finished. Thus, the following in particular seems to work:
using CUDArt
md = CuModule("/path/to/module.ptx",false)
MyFunc = CuFunction(md,"MyFunc")
GridDim = 2*2496
BlockDim = 64
launch(MyFunc, GridDim, BlockDim, (arg1, arg2, ...));
device_synchronize()
res = to_host(arg2)
I'd love to hear from anyone with more expertise though if there is anything more to be aware of here.
I think the more canonical way is to make a stream for each device:
streams = [(device(dev); Stream()) for dev in devlist]
and then inside the #async block, after you tell it to do the computations, you use the wait(stream) function to tell it to wait for that stream to finish its computations. See the Streams example in the README.

Does using global variables impact performance in MATLAB?

As I understand, MATLAB cannot use pass by reference when sending arguments to other functions. I am doing audio processing, and I frequently have to pass waveforms as arguments into functions, and because MATLAB uses pass by value for these arguments, it really eats up a lot of RAM when I do this.
I was considering using global variables as a method to pass my waveforms into functions, but everywhere I read there seems to be a general opinion that this is a bad idea, for organization of code, and potentially performance issues... but I haven't really read any detailed answers on how this might impact performance...
My question: What are the negative impacts of using global variables (with sizes > 100MB) to pass arguments to other functions in MATLAB, both in terms of 1) performance and 2) general code organization and good practice.
EDIT: From #Justin's answer below, it turns out MATLAB does on occasion use pass by reference when you do not modify the argument within the function! From this, I have a second related question about global variable performance:
Will using global variables be any slower than using pass by reference arguments to functions?
MATLAB does use pass by reference, but also uses copy-on-write. That is to say, your variable will be passed by reference into the function (and so won't double up on RAM), but if you change the variable within the the function, then MATLAB will create a copy and change the copy (leaving the original unaffected).
This fact doesn't seem to be too well known, but there's a good post on Loren's blog discussing it.
Bottom line: it sounds like you don't need to use global variables at all (which are a bad idea as #Adriaan says).
While relying on copy on write as Justin suggested is typically the best choice, you can easily implement pass by reference. With Matlab oop being nearly as fast as traditional functions in Matlab 2015b or newer, using handle is a reasonable option.
I encountered an interesting use case of a global variable yesterday. I tried to parallellise a piece of code (1200 lines, multiple functions inside the main function, not written by me), using parfor.
Some weird errors came out and it turned out that this piece of code wrote to a log file, but used multiple functions to write to the log file. Rather than opening and closing the relevant log file every time a function wanted to write to it, which is very slow, the file ID was made global, so that all write-functions could access it.
For the serial case this made perfect sense, but when trying to parallellise this, using global apparently breaks the scope of a worker instance as well. So suddenly we had 4 workers all trying to write into the same log file, which resulted in some weird errors.
So all in all, I maintain my position that using global variables is generally a bad idea, although I can see its use in specific cases, provided you know what you're doing.
Using global variables in Matlab may increase performance alot. This is because you can avoid copying of data in some cases.
Before attempting to gain such performance tweaks, think carefully of the cost to your project, in terms of the many drawbacks that global variables come with. There are also pitfalls to using globals with bad consequences to performance, and those may be difficult to avoid(although possible). Any code that is littered with globals tend to be difficult to comprehend.
If you want to see globals in use for performance, you can look at this real-time toolbox for optical flow that I made. This is the only project in native Matlab that is capable of real-time optical flow that I know of. Using globals was one of the reasons this was doable. It is also a reason to why the code is quite difficult to grasp: Globals are evil.
That globals can be used this way is not a way to argue for their use, rather it should be a hint that something should be updated with Matlabs unflexible notions of workspace and inefficient alternatives to globals such as guidata/getappdata/setappdata.

How to realize nested parallel in R on windows platform

I try to use parApply() in parallel package in R.
cl <- makeCluster(16)
cl.boot <-makeCluster(8)
In my programme, I call t(parApply(cl,rv,1,sim.one.test)) first. In function sim.one.test,a call a function boot(). And in boot(), I use
bs.resample <- t(parApply(cl.boot,rv.boot,1,function(x) bs.mle(n1,n2,x,s,t1,t2,m,theta)))
Simply, the outer function is sim.one.test() and inner one is bs.mle().
The error information is invalid connection. I guess this is because nested parallel is not supported. From another questions on stackoverflow, it is suggested I should use mcapply() which only can be applied on Linux but I run the programme on Windows platform. Is there any solution for nested parallel computing on Windows platform? Thanks.
Why do you think you need nested parallelization? That will just increase your parallelization overhead (if it works at all, which I doubt). Conceptually, it's by far better to only parallelize the outer loop (provided it contains enough iterations and is more or less load-balanced).
However, you could use nested foreach loops with a parallel backend. That would convert your nested loops into one loop before sending it to the workers.

OpenAcc : How To parallelize the function calls

I am working on a project,i am trying to parallelize the application.
there are some functions which i am trying to parallelize but the problem is these function call other functions very frequently.loops are only for computation and there are many loops in one function body.
I know OpenACC does not support function calls(only inline calls) within its directive,So i have came up with two approaches :
a) either Just put OpenAcc Directive around the loops and get the required parallelism and ignore the function call (not just ignore it just keep it as it is )(do this in each and every function body)
b) or I can put the called the function body inside the calling function then the overhead of thread creation multiple times when entering the acc directive is minimized ( by including a large number of loops in one block).but this seems to be much of a headache becuase the function bodies are large ( about 4000-5000 lines of code).
I can't figure out how to handle such scenario.
in summary i need to find an efficient way to parallelize the function calls in OpenACC
As some Mark Ebersole said, OpenACC 2.0 is the solution. The routine directive in 2.0 allows marking functions as device targets.

Resources