How to run a method in parallel using Julia? - parallel-processing

I was reading Parallel Computing docs of Julia, and having never done any parallel coding, I was left wanting a gentler intro. So, I thought of a (probably) simple problem that I couldn't figure out how to code in parallel Julia paradigm.
Let's say I have a matrix/dataframe df from some experiment. Its N rows are variables, and M columns are samples. I have a method pwCorr(..) that calculates pairwise correlation of rows. If I wanted an NxN matrix of all the pairwise correlations, I'd probably run a for-loop that'd iterate for N*N/2 (upper or lower triangle of the matrix) and fill in the values; however, this seems like a perfect thing to parallelize since each of the pwCorr() calls are independent of others. (Am I correct in thinking this way about what can be parallelized, and what cannot?)
To do this, I feel like I'd have to create a DArray that gets filled by a #parallel for loop. And if so, I'm not sure how this can be achieved in Julia. If that's not the right approach, I guess I don't even know where to begin.

This should work, first you need to propagate the top level variable (data) to all the workers:
for pid in workers()
remotecall(pid, x->(global data; data=x; nothing), data)
end
then perform the computation in chunks using the DArray constructor with some fancy indexing:
corrs = DArray((20,20)) do I
out=zeros(length(I[1]),length(I[2]))
for i=I[1], j=I[2]
if i<j
out[i-minimum(I[1])+1,j-minimum(I[2])+1]= 0.0
else
out[i-minimum(I[1])+1,j-minimum(I[2])+1] = cor(vec(data[i,:]), vec(data[j,:]))
end
end
out
end
In more detail, the DArray constructor takes a function which takes a tuple of index ranges and returns a chunk of the resulting matrix which corresponds to those index ranges. In the code above, I is the tuple of ranges with I[1] being the first range. You can see this more clearly with:
julia> DArray((10,10)) do I
println(I)
return zeros(length(I[1]),length(I[2]))
end
From worker 2: (1:10,1:5)
From worker 3: (1:10,6:10)
where you can see it split the array into two chunks on the second axis.
The trickiest part of the example was converting from these 'global' index ranges to local index ranges by subtracting off the minimum element and then adding back 1 for the 1 based indexing of Julia.
Hope that helps!

Related

Never ending 'for' loop prevents my RStudio notebook from being rendered into a .md file

I'm trying to calculate the Kolmogorov-Smirnov statistic in R. I have the following sample, which clearly comes from a random variable that follows a long-tailed distribution.
Download link
https://drive.google.com/file/d/1hIgqikX7p343zdyc-Goq34THUpsZA63n/view?usp=sharing
As you may know, the Kolmogorov-Smirnov statistic requires the calculation of the empirical cumulative distribution function and the presumed cumulative distribution function. For both calculations I take the following approach: first, I create a vector with the same length as the length of the sample, and then I modify each of the components of the vector so as for it to contain the empirical cdf (or presumed cdf) of the corresponding observation of the sample.
For the sake of illustration, I'll show you the code I wrote in order to calculate the empirical cdf.
I'm assuming that the data has been read and stored in a dataframe called data.
ecdf = vector("numeric", length(data$logueos))for (i in 1:length(data$logueos)) {ecdf[i] = sum (data$logueos <= data$logueos[i])/length(data$logueos)}
The code I wrote for the calculation of the presumed cdf is analogous to the preceding one; the only difference is that I set each component of the pcdf vector equal to the formula $P(X<=t)$ —where t is the corresponding observation of the sample— according to the distribution that I'm assuming.
The problem is that this 'for' loop never ends. If I force it to end by clicking RStudio's stop button it works: it makes the vector store what I want it to store. But, if I press Ctrl+Shift+k in order to render my notebook and preview it, the load gets stuck when trying to execute the first chunk encountered that contains one of those loops.
First of all, your loop is not endless. It will finish, eventually.
You start initializing a vector with as much elements as the number of observations (1.245.888, which is a lot of iterations). This vector is FULL OF ZEROS.
What your loop does is iterate while changing each zero with the calculus sum (data$logueos <= data$logueos[i])/length(data$logueos). Check that when you stop the execution, the first values of your vector will be values between 0 and 1 while the last values is going to be 0s (because the loop hasn't arrived there yet).
So, you will have to wait more time.
In order to make the execution faster, you could consider loop parallelization (because standard loops go sequentially, one by one, and if it's too much wait, parallelization makes it faster. For example, executing 4 by 4, depending of your computer capacities). Here you'll find some information about it: https://nceas.github.io/oss-lessons/parallel-computing-in-r/parallel-computing-in-r.html
Then, my proposal to you:
if(!require(foreach)){install.packages("foreach")}; require(foreach)
registerDoParallel(detectCores() - 1)
ecdf = vector("numeric", length(data$logueos))
foreach (i=1:length(data$logueos)) %do% {
print(i)
ecdf[i] = sum (data$logueos <= data$logueos[i])/length(data$logueos)
}
The first line will download and load foreach library, that you
need for parallelization.
detectCores() - 1 is going to use all the
processors that your computer has except one (to avoid freezing your
machine) for computing this loop. You'll see that is going to be
faster!
registerDoParallel function is what tells to foreach how many cores use.

Collecting results of #parallel for-loop via remotecall

I use the #parallel for macro to run simulations for a range of parameters. Each run results in a 1-dimensional vector. In the end I would like to collect the results in a DataFrame.
Up until now I had always created an intermediate array and reduced the for-loop with vcat; then constructed the DataFrame. I thought it might also work to push! the result of each calculation to the master process via remotecall. A minimal example would look like
X=Float64[]
#sync #parallel for i in linspace(1.,10.,10)
remotecall_fetch(()->push!(X,i),1)
end
The result of which is consistently an array X with 9 not 10 elements. The number of dropped elements becomes larger as more workers are added.
This is on julia-0.6.1.
I thought I had understood julia's parallel computing structure, but it seems not.
What is the reason for this behavior? And how can I do it better and safely?
I suspect you're triggering a race condition, though couldn't say where.
If you only need to return one value per iteration, I would suggest just using pmap:
pmap(linspace(1.,10.,10)) do i
i
end
otherwise if each iteration could return multiple values, it would probably best to use RemoteChannels.

Faster way of testing a condition in MATLAB

I need to run many many tests of the form a<0 where a is a vector (a relatively short one). I am currently doing it with
all(v<0)
Is there a faster way?
Not sure which one will be faster (that may depend on the machine and Matlab version), but here are some alternatives to all(v<0):
~any(v>0)
nnz(v>=0)==0 %// Or ~nnz(v>=0)
sum(v>=0)==0 %// Or ~sum(v>=0)
isempty(find(v>0, 1)) %// Or isempty(find(v>0))
I think the issue is that the conditional is executed on all elements of the array first, then the condition is tested... That is, for the test "any(v<0)", matlab does the following I believe:
Step 1: compute v<0 for every element of v
Step 2: search through the results of step 1 for a true value
So even if the first element of v is less than zero, the conditional was first computed for all elements, hence wasting a lot of time. I think this is also true for any of the alternative solutions offered above.
I don't know of a faster way to do it easily, but wish I did. In some cases, breaking the array v up into smaller chunks and testing incrementally could speed things up, particularly if the condition is common. For example:
function result = anyLessThanZero(v);
w = v(:);
result = true;
for i=1:numel(w)
if ( w(i) < 0 )
return;
end
end
result = false;
end
but that can be very inefficient if the condition is rare. (If you were to really do this, there is probably a better way than I illustrate above to handle any condition, not just <0, but I show it this way to make it clear).

foreach loop is not working(parallelization)

If I want to speed up the following code, How can I do that?
pcg <- foreach(boot.iter=1:boot.rep) %dopar% {
d.boot<-d[in.sample[[boot.iter]],]
*here in.sample[[boot.iter]] randomly generates 1000 row numbers.
I planned to split the overall tasks and send the seperated trials to each core. for example,
sub_task<-foreach(i=1:cores.use)%dopar%{
for (j in 1:trialsPerCore){
d.boot<-d[in.sample[[structure[i,j]]],]}}
*structure is a matrix which contains from 1 to boot.rep
But this one would not work, seems like we cannot use "for" loop inside the foreach? Also, the d.boot only keeps the last iteration of each core.
I tried to search online, I found the following code works,
sub_task<foreach(i=1:cores.use)%:%
foreach(j=1:trialsPerCore)%dopar%{
d.boot<-d[in.sample[[structure[i,j]]],]}
But I think it is similar to my original function, and I do not think there is a great enhancement.
Do you guys have any suggestions?
Unless I'm missing something, it doesn't look like you're doing much if any computation in your foreach loop. You appear to be simply creating a list of matrices from d. That wouldn't benefit from parallel computing unless you can perform an operation on those matrices in your loop, and ideally return a relatively small result from that operation.
Although "chunking" often helps to execute parallel loops more efficiently, I don't think it's going to help here. The communication may be a little more efficient, but you're still just doing a lot of communication and essentially no computation.
Note that your attempt at chunking doesn't work because the for loop in the foreach loop is repeatedly assigning a matrix to the same variable. Then, the for loop itself returns a NULL as the body of the foreach loop, so that sub_task is a list of NULL's. An lapply would work much better in this context.
It will help a little to compute the values in the in.sample list in the foreach loop. That will decrease the amount of data that is auto-exported to each of the workers at the cost of a bit more computation on the workers, which is generally what you want to do in parallel loops. At the very least, you could iterate over in.sample directly:
pcg <- foreach(i=in.sample) %dopar% d[i,]
In this form, it's all the more obvious that there isn't enough computation to warrant parallel computing. If there isn't any real computation to perform, you're better off using lapply:
pcg <- lapply(in.sample, function(i) d[i,])

Vectorization of matlab code

i'm kinda new to vectorization. Have tried myself but couldn't. Can somebody help me vectorize this code as well as give a short explaination on how u do it, so that i can adapt the thinking process too. Thanks.
function [result] = newHitTest (point,Polygon,r,tol,stepSize)
%This function calculates whether a point is allowed.
%First is a quick test is done by calculating the distance from point to
%each point of the polygon. If that distance is smaller than range "r",
%the point is not allowed. This will slow down the algorithm at some
%points, but will greatly speed it up in others because less calls to the
%circleTest routine are needed.
polySize=size(Polygon,1);
testCounter=0;
for i=1:polySize
d = sqrt(sum((Polygon(i,:)-point).^2));
if d < tol*r
testCounter=1;
break
end
end
if testCounter == 0
circleTestResult = circleTest (point,Polygon,r,tol,stepSize);
testCounter = circleTestResult;
end
result = testCounter;
Given the information that Polygon is 2 dimensional, point is a row vector and the other variables are scalars, here is the first version of your new function (scroll down to see that there are lots of ways to skin this cat):
function [result] = newHitTest (point,Polygon,r,tol,stepSize)
result = 0;
linDiff = Polygon-repmat(point,size(Polygon,1),1);
testLogicals = sqrt( sum( ( linDiff ).^2 ,2 )) < tol*r;
if any(testLogicals); result = circleTest (point,Polygon,r,tol,stepSize); end
The thought process for vectorization in Matlab involves trying to operate on as much data as possible using a single command. Most of the basic builtin Matlab functions operate very efficiently on multi-dimensional data. Using for loop is the reverse of this, as you are breaking your data down into smaller segments for processing, each of which must be interpreted individually. By resorting to data decomposition using for loops, you potentially loose some of the massive performance benefits associated with the highly optimised code behind the Matlab builtin functions.
The first thing to think about in your example is the conditional break in your main loop. You cannot break from a vectorized process. Instead, calculate all possibilities, make an array of the outcome for each row of your data, then use the any keyword to see if any of your rows have signalled that the circleTest function should be called.
NOTE: It is not easy to efficiently conditionally break out of a calculation in Matlab. However, as you are just computing a form of Euclidean distance in the loop, you'll probably see a performance boost by using the vectorized version and calculating all possibilities. If the computation in your loop were more expensive, the input data were large, and you wanted to break out as soon as you hit a certain condition, then a matlab extension made with a compiled language could potentially be much faster than a vectorized version where you might be performing needless calculation. However this is assuming that you know how to program code that matches the performance of the Matlab builtins in a language that compiles to native code.
Back on topic ...
The first thing to do is to take the linear difference (linDiff in the code example) between Polygon and your row vector point. To do this in a vectorized manner, the dimensions of the 2 variables must be identical. One way to achieve this is to use repmat to copy each row of point to make it the same size as Polygon. However, bsxfun is usually a superior alternative to repmat (as described in this recent SO question), making the code ...
function [result] = newHitTest (point,Polygon,r,tol,stepSize)
result = 0;
linDiff = bsxfun(#minus, Polygon, point);
testLogicals = sqrt( sum( ( linDiff ).^2 ,2 )) < tol*r;
if any(testLogicals); result = circleTest (point,Polygon,r,tol,stepSize); end
I rolled your d value into a column of d by summing across the 2nd axis (note the removal of the array index from Polygon and the addition of ,2 in the sum command). I then went further and evaluated the logical array testLogicals inline with the calculation of the distance measure. You will quickly see that a downside of heavy vectorisation is that it can make the code less readable to those not familiar with Matlab, but the performance gains are worth it. Comments are pretty necessary.
Now, if you want to go completely crazy, you could argue that the test function is so simple now that it warrants use of an 'anonymous function' or 'lambda' rather than a complete function definition. The test for whether or not it is worth doing the circleTest does not require the stepSize argument either, which is another reason for perhaps using an anonymous function. You can roll your test into an anonymous function and then jut use circleTest in your calling script, making the code self documenting to some extent . . .
doCircleTest = #(point,Polygon,r,tol) any(sqrt( sum( bsxfun(#minus, Polygon, point).^2, 2 )) < tol*r);
if doCircleTest(point,Polygon,r,tol)
result = circleTest (point,Polygon,r,tol,stepSize);
else
result = 0;
end
Now everything is vectorised, the use of function handles gives me another idea . . .
If you plan on performing this at multiple points in the code, the repetition of the if statements would get a bit ugly. To stay dry, it seems sensible to put the test with the conditional function into a single function, just as you did in your original post. However, the utility of that function would be very narrow - it would only test if the circleTest function should be executed, and then execute it if needs be.
Now imagine that after a while, you have some other conditional functions, just like circleTest, with their own equivalent of doCircleTest. It would be nice to reuse the conditional switching code maybe. For this, make a function like your original that takes a default value, the boolean result of the computationally cheap test function, and the function handle of the expensive conditional function with its associated arguments ...
function result = conditionalFun( default, cheapFunResult, expensiveFun, varargin )
if cheapFunResult
result = expensiveFun(varargin{:});
else
result = default;
end
end %//of function
You could call this function from your main script with the following . . .
result = conditionalFun(0, doCircleTest(point,Polygon,r,tol), #circleTest, point,Polygon,r,tol,stepSize);
...and the beauty of it is you can use any test, default value, and expensive function. Perhaps a little overkill for this simple example, but it is where my mind wandered when I brought up the idea of using function handles.

Resources