Collecting results of #parallel for-loop via remotecall - parallel-processing

I use the #parallel for macro to run simulations for a range of parameters. Each run results in a 1-dimensional vector. In the end I would like to collect the results in a DataFrame.
Up until now I had always created an intermediate array and reduced the for-loop with vcat; then constructed the DataFrame. I thought it might also work to push! the result of each calculation to the master process via remotecall. A minimal example would look like
X=Float64[]
#sync #parallel for i in linspace(1.,10.,10)
remotecall_fetch(()->push!(X,i),1)
end
The result of which is consistently an array X with 9 not 10 elements. The number of dropped elements becomes larger as more workers are added.
This is on julia-0.6.1.
I thought I had understood julia's parallel computing structure, but it seems not.
What is the reason for this behavior? And how can I do it better and safely?

I suspect you're triggering a race condition, though couldn't say where.
If you only need to return one value per iteration, I would suggest just using pmap:
pmap(linspace(1.,10.,10)) do i
i
end
otherwise if each iteration could return multiple values, it would probably best to use RemoteChannels.

Related

Never ending 'for' loop prevents my RStudio notebook from being rendered into a .md file

I'm trying to calculate the Kolmogorov-Smirnov statistic in R. I have the following sample, which clearly comes from a random variable that follows a long-tailed distribution.
Download link
https://drive.google.com/file/d/1hIgqikX7p343zdyc-Goq34THUpsZA63n/view?usp=sharing
As you may know, the Kolmogorov-Smirnov statistic requires the calculation of the empirical cumulative distribution function and the presumed cumulative distribution function. For both calculations I take the following approach: first, I create a vector with the same length as the length of the sample, and then I modify each of the components of the vector so as for it to contain the empirical cdf (or presumed cdf) of the corresponding observation of the sample.
For the sake of illustration, I'll show you the code I wrote in order to calculate the empirical cdf.
I'm assuming that the data has been read and stored in a dataframe called data.
ecdf = vector("numeric", length(data$logueos))for (i in 1:length(data$logueos)) {ecdf[i] = sum (data$logueos <= data$logueos[i])/length(data$logueos)}
The code I wrote for the calculation of the presumed cdf is analogous to the preceding one; the only difference is that I set each component of the pcdf vector equal to the formula $P(X<=t)$ —where t is the corresponding observation of the sample— according to the distribution that I'm assuming.
The problem is that this 'for' loop never ends. If I force it to end by clicking RStudio's stop button it works: it makes the vector store what I want it to store. But, if I press Ctrl+Shift+k in order to render my notebook and preview it, the load gets stuck when trying to execute the first chunk encountered that contains one of those loops.
First of all, your loop is not endless. It will finish, eventually.
You start initializing a vector with as much elements as the number of observations (1.245.888, which is a lot of iterations). This vector is FULL OF ZEROS.
What your loop does is iterate while changing each zero with the calculus sum (data$logueos <= data$logueos[i])/length(data$logueos). Check that when you stop the execution, the first values of your vector will be values between 0 and 1 while the last values is going to be 0s (because the loop hasn't arrived there yet).
So, you will have to wait more time.
In order to make the execution faster, you could consider loop parallelization (because standard loops go sequentially, one by one, and if it's too much wait, parallelization makes it faster. For example, executing 4 by 4, depending of your computer capacities). Here you'll find some information about it: https://nceas.github.io/oss-lessons/parallel-computing-in-r/parallel-computing-in-r.html
Then, my proposal to you:
if(!require(foreach)){install.packages("foreach")}; require(foreach)
registerDoParallel(detectCores() - 1)
ecdf = vector("numeric", length(data$logueos))
foreach (i=1:length(data$logueos)) %do% {
print(i)
ecdf[i] = sum (data$logueos <= data$logueos[i])/length(data$logueos)
}
The first line will download and load foreach library, that you
need for parallelization.
detectCores() - 1 is going to use all the
processors that your computer has except one (to avoid freezing your
machine) for computing this loop. You'll see that is going to be
faster!
registerDoParallel function is what tells to foreach how many cores use.

Data management in a parallel for-loop in Julia

I'm trying to do some statistical analysis using Julia. The code consists of the files script.jl (e.g. initialisation of the data) and algorithm.jl.
The number of simulations is large (at least 100,000) so it makes sense to use parallel processing.
The code below is just some pseudocode to illustrate my question —
function script(simulations::Int64)
# initialise input data
...
# initialise other variables for statistical analysis using zeros()
...
require("algorithm.jl")
#parallel for z = 1:simulations
while true
choices = algorithm(data);
if length(choices) == 0
break
else
# process choices and pick one (which alters the data)
...
end
end
end
# display results of statistical analysis
...
end
and
function algorithm(data)
# actual algorithm
...
return choices;
end
As example, I would like to know how many choices there are on average, what is the most common choice, and so on. For this purpose I need to save some data from choices (in the for-loop) to the statistical analysis variables (initialised before the for-loop) and display the results (after the for-loop).
I've read about using #spawn and fetch() and functions like pmap() but I'm not sure how I should proceed. Just using the variables inside the for-loop does not work as each proc gets its own copy, so the values of the statistical analysis variables after the for-loop will just be zeros.
[Edit] In Julia I use include("script.jl") and script(100000) to run the simulations, there are no issues when using a single proc. However, when using multiple procs (e.g. using addprocs(3)) all statistical variables are zeros after the for-loop — which is to be expected.
It seems that you want to parallelize an inherently serial operations, because each operation is related to the result of another one (in this case data).
I think if you could implement the above code like:
#parallel (dosumethingwithdata) for z = 1:simulations
while true
choices = algorithm(data,z);
if length(choices) == 0
break
else
# process choices and pick one (which alters the data)
...
end
data
end
end
then you may find a parallel solution for the problem.

How to run a method in parallel using Julia?

I was reading Parallel Computing docs of Julia, and having never done any parallel coding, I was left wanting a gentler intro. So, I thought of a (probably) simple problem that I couldn't figure out how to code in parallel Julia paradigm.
Let's say I have a matrix/dataframe df from some experiment. Its N rows are variables, and M columns are samples. I have a method pwCorr(..) that calculates pairwise correlation of rows. If I wanted an NxN matrix of all the pairwise correlations, I'd probably run a for-loop that'd iterate for N*N/2 (upper or lower triangle of the matrix) and fill in the values; however, this seems like a perfect thing to parallelize since each of the pwCorr() calls are independent of others. (Am I correct in thinking this way about what can be parallelized, and what cannot?)
To do this, I feel like I'd have to create a DArray that gets filled by a #parallel for loop. And if so, I'm not sure how this can be achieved in Julia. If that's not the right approach, I guess I don't even know where to begin.
This should work, first you need to propagate the top level variable (data) to all the workers:
for pid in workers()
remotecall(pid, x->(global data; data=x; nothing), data)
end
then perform the computation in chunks using the DArray constructor with some fancy indexing:
corrs = DArray((20,20)) do I
out=zeros(length(I[1]),length(I[2]))
for i=I[1], j=I[2]
if i<j
out[i-minimum(I[1])+1,j-minimum(I[2])+1]= 0.0
else
out[i-minimum(I[1])+1,j-minimum(I[2])+1] = cor(vec(data[i,:]), vec(data[j,:]))
end
end
out
end
In more detail, the DArray constructor takes a function which takes a tuple of index ranges and returns a chunk of the resulting matrix which corresponds to those index ranges. In the code above, I is the tuple of ranges with I[1] being the first range. You can see this more clearly with:
julia> DArray((10,10)) do I
println(I)
return zeros(length(I[1]),length(I[2]))
end
From worker 2: (1:10,1:5)
From worker 3: (1:10,6:10)
where you can see it split the array into two chunks on the second axis.
The trickiest part of the example was converting from these 'global' index ranges to local index ranges by subtracting off the minimum element and then adding back 1 for the 1 based indexing of Julia.
Hope that helps!

foreach loop is not working(parallelization)

If I want to speed up the following code, How can I do that?
pcg <- foreach(boot.iter=1:boot.rep) %dopar% {
d.boot<-d[in.sample[[boot.iter]],]
*here in.sample[[boot.iter]] randomly generates 1000 row numbers.
I planned to split the overall tasks and send the seperated trials to each core. for example,
sub_task<-foreach(i=1:cores.use)%dopar%{
for (j in 1:trialsPerCore){
d.boot<-d[in.sample[[structure[i,j]]],]}}
*structure is a matrix which contains from 1 to boot.rep
But this one would not work, seems like we cannot use "for" loop inside the foreach? Also, the d.boot only keeps the last iteration of each core.
I tried to search online, I found the following code works,
sub_task<foreach(i=1:cores.use)%:%
foreach(j=1:trialsPerCore)%dopar%{
d.boot<-d[in.sample[[structure[i,j]]],]}
But I think it is similar to my original function, and I do not think there is a great enhancement.
Do you guys have any suggestions?
Unless I'm missing something, it doesn't look like you're doing much if any computation in your foreach loop. You appear to be simply creating a list of matrices from d. That wouldn't benefit from parallel computing unless you can perform an operation on those matrices in your loop, and ideally return a relatively small result from that operation.
Although "chunking" often helps to execute parallel loops more efficiently, I don't think it's going to help here. The communication may be a little more efficient, but you're still just doing a lot of communication and essentially no computation.
Note that your attempt at chunking doesn't work because the for loop in the foreach loop is repeatedly assigning a matrix to the same variable. Then, the for loop itself returns a NULL as the body of the foreach loop, so that sub_task is a list of NULL's. An lapply would work much better in this context.
It will help a little to compute the values in the in.sample list in the foreach loop. That will decrease the amount of data that is auto-exported to each of the workers at the cost of a bit more computation on the workers, which is generally what you want to do in parallel loops. At the very least, you could iterate over in.sample directly:
pcg <- foreach(i=in.sample) %dopar% d[i,]
In this form, it's all the more obvious that there isn't enough computation to warrant parallel computing. If there isn't any real computation to perform, you're better off using lapply:
pcg <- lapply(in.sample, function(i) d[i,])

Parallelizing an algorithm with many exit points?

I'm faced with parallelizing an algorithm which in its serial implementation examines the six faces of a cube of array locations within a much larger three dimensional array. (That is, select an array element, and then define a cube or cuboid around that element 'n' elements distant in x, y, and z, bounded by the bounds of the array.
Each work unit looks something like this (Fortran pseudocode; the serial algorithm is in Fortran):
do n1=nlo,nhi
do o1=olo,ohi
if (somecondition(n1,o1) .eq. .TRUE.) then
retval =.TRUE.
RETURN
endif
end do
end do
Or C pseudocode:
for (n1=nlo,n1<=nhi,n++) {
for (o1=olo,o1<=ohi,o++) {
if(somecondition(n1,o1)!=0) {
return (bool)true;
}
}
}
There are six work units like this in the total algorithm, where the 'lo' and 'hi' values generally range between 10 and 300.
What I think would be best would be to schedule six or more threads of execution, round-robin if there aren't that many CPU cores, ideally with the loops executing in parallel, with the goal the same as the serial algorithm: somecondition() becomes True, execution among all the threads must immediately stop and a value of True set in a shared location.
What techniques exist in a Windows compiler to facilitate parallelizing tasks like this? Obviously, I need a master thread which waits on a semaphore or the completion of the worker threads, so there is a need for nesting and signaling, but my experience with OpenMP is introductory at this point.
Are there message passing mechanisms in OpenMP?
EDIT: If the highest difference between "nlo" and "nhi" or "olo" and "ohi" is eight to ten, that would imply no more than 64 to 100 iterations for this nested loop, and no more than 384 to 600 iterations for the six work units together. Based on that, is it worth parallelizing at all?
Would it be better to parallelize the loop over the array elements and leave this algorithm serial, with multiple threads running the algorithm on different array elements? I'm thinking this from your comment "The time consumption comes from the fact that every element in the array must be tested like this. The arrays commonly have between four million and twenty million elements." The design of implementing the parallelelization of the array elements is also flexible in terms of the number threads. Unless there is a reason that the array elements have to be checked in some order?
It seems that the portion that you are showing us doesn't take that long to execute so making it take less clock time by making it parallel might not be easy ... there is always some overhead to multiple threads, and if there is not much time to gain, parallel code might not be faster.
One possibility is to use OpenMP to parallelize over the 6 loops -- declare logical :: array(6), allow each loop to run to completion, and then retval = any(array). Then you can check this value and return outside the parallelized loop. Add a schedule(dynamic) to the parallel do statement if you do this. Or, have a separate !$omp parallel and then put !$omp do schedule(dynamic) ... !$omp end do nowait around each of the 6 loops.
Or, you can follow the good advice by #M.S.B. and parallelize the outermost loop over the whole array. The problem here is that you cannot have a RETURN inside a parallel loop -- so label the second outermost loop (the largest one within the parallel part), and EXIT that loop -- smth like
retval = .FALSE.
!$omp parallel do default(private) shared(BIGARRAY,retval) schedule(dynamic,1)
do k=1,NN
if(.not. retval) then
outer2: do j=1,NN
do i=1,NN
! --- your loop #1
do n1=nlo,nhi
do o1=olo,ohi
if (somecondition(BIGARRAY(i,j,k),n1,o1)) then
retval =.TRUE.
exit outer2
endif
end do
end do
! --- your loops #2 ... #6 go here
end do
end do outer2
end if
end do
!$omp end parallel do
[edit: the if statement is there presuming that you need to find out if there is at least one element like that in the big array. If you need to figure the condition for every element, you can similarly either add a dummy loop exit or goto, skipping the rest of the processing for that element. Again, use schedule(dynamic) or schedule(guided).]
As a separate point, you might also want to check if it may be a good idea to go through the innermost loop by some larger step (depending on float size), compute a vector of logicals on each iteration and then aggregate the results, eg. smth like if(count(somecondition(x(o1:o1+step,n1,k)))>0); in this case the compiler may be able to vectorize somecondition.
I believe you can do what you want with the task construct introduced in OpenMP 3; Intel Fortran supports tasking in OpenMP. I don't use tasks often so I won't offer you any wonky pseudocode.
You already mentioned the obvious way to stop all threads as soon as any thread finds the ending condition: have each check some shared variable which gives the status of the ending condition, thereby determining whether to break out of the loops. Obviously this is an overhead, so if you decide to take this approach I would suggest a few things:
Use atomics to check the ending condition, this avoids expensive memory flushing as just the variable in question is flushed. Move to OpenMP 3.1, there are some new atomic operations supported.
Check infrequently, maybe like once per outer iteration. You should only be parallelizing large cases to overcome the overhead of multithreading.
This one is optional, but you can try adding compiler hints, e.g. if you expect a certain condition to be false most of the time, the compiler will optimize the code accordingly.
Another (somewhat dirty) approach is to use shared variables for the loop ranges for each thread, maybe use a shared array where index n is for thread n. When one thread finds the ending condition, it changes the loop ranges of all the other threads so that they stop. You'll need the appropriate memory synchronization. Basically the overhead has now moved from checking a dummy variable to synchronizing/checking loop conditions. Again probably not so good to do this frequently, so maybe use shared outer loop variables and private inner loop variables.
On another note, this reminds me of the classic polling versus interrupt problem. Unfortunately I don't think OpenMP supports interrupts where you can send some kind of kill signal to each thread.
There are hacking work-arounds like using a child process for just this parallel work and invoking the operating system scheduler to emulate interrupts, however this is rather tricky to get correct and would make your code extremely unportable.
Update in response to comment:
Try something like this:
char shared_var = 0;
#pragma omp parallel
{
//you should have some method for setting loop ranges for each thread
for (n1=nlo; n1<=nhi; n1++) {
for (o1=olo; o1<=ohi; o1++) {
if (somecondition(n1,o1)!=0) {
#pragma omp atomic write
shared_var = 1; //done marker, this will also trigger the other break below
break; //could instead use goto to break out of both loops in 1 go
}
}
#pragma omp atomic read
private_var = shared_var;
if (private_var!=0) break;
}
}
A suitable parallel approach might be, to let each worker examine a part of the overall problem, exactly as in the serial case and use a local (non-shared) variable for the result (retval). Finally do a reduction over all workers on these local variables into a shared overall result.

Resources