Concurrence of Julia Parallel Computing - parallel-processing

I am new to Julia and studying Julia parallel computing recently.
I am still not clear about the accurate mechanism of Julia's parallelism including macros \#sync and \#async after I read the relevant documents.
The following is the pmap function from the Julia v0.5 documentation:
function pmap(f, lst)
np = nprocs() # determine the number of processes available
n = length(lst)
results = Vector{Any}(n)
i = 1
# function to produce the next work item from the queue.
# in this case it's just an index.
nextidx() = (idx=i; i+=1; idx)
#sync begin
for p=1:np
if p != myid() || np == 1
#async begin
while true
idx = nextidx()
if idx > n
break
end
results[idx] = remotecall_fetch(f, p, lst[idx])
end
end
end
end
end
results
end
Is it possible for different two processors/workers call nextidx() at the same time getting the same idx = j? If yes, I feel results[j] will be computed twice and result[j+1] will not be computed.
Thanks very much.
More findings:
function f1()
i=1
nextidx()=(idx=i;sleep(1);i+=1;idx)
for p=1:2
#async begin
idx=nextidx()
println(idx)
end
end
end
f1()
The result is 1 1.
Through this I find the time periods during which the two tasks call the function nextidx() could overlap. So I feel that in the first code,
if np = 3 (i.e. two workers), and the length n of lst is very large, say 10^8,
it's possible for the tasks to get the same index.
It may happen just because of a coincidence in time, i.e., the two tasks take the expression idx = i at almost the same time point, so the code is not stable.
Am I right?

No, so long as you schedule jobs correctly. The pmap documentation points out that the master process schedules the jobs serially; only the parallel execution is asynchronous. pmap as coded requires no thread locks to ensure correct job scheduling. Adding sleep to nextidx deliberately breaks this feature and introduces the race condition that you observe. Assuming that all of the processes share state -- which is the purpose of the nextidx function -- then the master process will not schedule the same job twice.

Related

temperary shared parallelism (OMP) within distributed environment (MPI)

Consider the following scenario:
We use n MPI processes to evaluate chunks of a large matrix A. Each chunk is evaluated by a unique process. Number of OMP threads is 1.
We gather these chunks in the master process (rank=0) and build the full matrix A.
The master process calls a function x = f(A) and broadcasts the result x to all processes.
The issue is that f(A) takes a long time, and other processes have to wait.
If the function f can be accelerated with OMP parallelism, does it make sense to set the number of OMP threads (on the master process) equal to n before calling f(A) and changing it to 1 afterwards?
In other words, what can be wrong with the following pseudo-code?
mpirun -n n ./exe # command for running the code
where the pseudo-code for exe looks something like
a = g(mpi_rank, ... ) # process-specific matrix
A = mpi_gather(a, mpi_rank=0) # gather all
if mpi_rank == 0
set_num_omp_threads(n)
x = f(A)
set_num_omp_threads(1)
else
x = 0
mpi_bcast(x, mpi_rank=0)
what are the possible performance issues/pitfalls?
EDIT: The original code is too complicated, but I have replicated the issue via the following code.
# run by: mpirun -n 6 python test.py
import time
import torch
import torch.distributed as dist
def get_weights(A, Y, use_omp=False):
# maximize OMP threads
master = dist.get_rank() == 0
if use_omp and master:
ws = dist.get_world_size()
th = torch.get_num_threads()
torch.set_num_threads(ws*th)
# actual calculation; only master
if dist.get_rank() == 0:
Q, R = torch.qr(A)
W = (R.inverse()#Q.t()#Y)
print(torch.get_num_threads())
else:
W = torch.zeros(A.shape[1], 1)
dist.broadcast(W, 0)
# reset OMP threads
if use_omp and master:
torch.set_num_threads(th)
return W
def test(n=100000, m=300, s=10, use_omp=False):
# ...
# normal distributed code ...
# ...
A = torch.rand(n, m)
_W = torch.rand(m, 1)
Y = A # _W
W = get_weights(A, Y, use_omp=use_omp) # <- this
res = W.allclose(_W, atol=1e-3)
return res
def timeit(repeat=10, use_omp=False):
t1 = time.time()
for _ in range(repeat):
test(use_omp=use_omp)
t2 = time.time()
return (t2-t1)/repeat
if __name__ == '__main__':
dist.init_process_group('mpi')
print(timeit(use_omp=False)) # -> 3.036880087852478
print(timeit(use_omp=True)) # -> 1.3959508180618285
In the above example setting threads to 6 improved the speed by a factor of ~2. But when I try sth similar to this in my actual code (and on a much larger cluster) it became almost 2 times slower!
EDIT2: The essence of this question is that (on a single node with n cores) most of the calculation (70%) is optimal with MPI using n processes, the other part (30%) is optimal with n OMP threads. I was wondering about optimal utilization of the available cores in both parts.
Although the comments were very helpful, I guess there are no easy answers. In this particular case the mentioned region is a linear algebra problem and using scalapack is probably the best solution.
But the question stands for general cases.
I think from what I understood from your question is to launch the OpenMP threads for the computation and then shutdown the OpenMP implementation. With your pseudo-code example, it would look like this:
a = g(mpi_rank, ... ) # process-specific matrix
A = mpi_gather(a, mpi_rank=0) # gather all
if mpi_rank == 0
x = f(A) ! do a parallel region in f()
omp_pause_resource_all(omp_pause_hard)
else
x = 0
mpi_bcast(x, mpi_rank=0)
See the functions omp_pause_resource_all() and omp_pause_resource() of the OpenMP API specification. They terminate most of the OpenMP implementation, such that only a very small part that should not affect performance will remain.
Note, the functions were introduced with OpenMP API version 5.0, so you will need a fairly new OpenMP compiler for this to work.
Arrange your MPI processes so that node 0 has only one process, and all other nodes have one process per core. Then process zero can easily launch an OpenMP (or otherwise threaded) shared-memory parallel section.

Why is this version of a matrix copy so slow?

I've noticed a strange behavior of julia during a matrix copy.
Consider the following three functions:
function priv_memcopyBtoA!(A::Matrix{Int}, B::Matrix{Int}, n::Int)
A[1:n,1:n] = B[1:n,1:n]
return nothing
end
function priv_memcopyBtoA2!(A::Matrix{Int}, B::Matrix{Int}, n::Int)
ii = 1; jj = 1;
while ii <= n
jj = 1 #(*)
while jj <= n
A[jj,ii] = B[jj,ii]
jj += 1
end
ii += 1
end
return nothing
end
function priv_memcopyBtoA3!(A::Matrix{Int}, B::Matrix{Int}, n::Int)
A[1:n,1:n] = view(B, 1:n, 1:n)
return nothing
end
Edit: 1) I tested if the code would throw an BoundsError so the line marked with jj = 1 #(*) was missing in the initial code. The testing results were already from the fixed version, so they remain unchanged. 2) I've added the view variant, thanks to #Colin T Bowers for addressing both issues.
It seems like both functions should lead to more or less the same code. Yet I get for
A = fill!(Matrix{Int32}(2^12,2^12),2); B = Int32.(eye(2^12));
the results
#timev priv_memcopyBtoA!(A,B, 2000)
0.178327 seconds (10 allocations: 15.259 MiB, 85.52% gc time)
elapsed time (ns): 178326537
gc time (ns): 152511699
bytes allocated: 16000304
pool allocs: 9
malloc() calls: 1
GC pauses: 1
and
#timev priv_memcopyBtoA2!(A,B, 2000)
0.015760 seconds (4 allocations: 160 bytes)
elapsed time (ns): 15759742
bytes allocated: 160
pool allocs: 4
and
#timev priv_memcopyBtoA3!(A,B, 2000)
0.043771 seconds (7 allocations: 224 bytes)
elapsed time (ns): 43770978
bytes allocated: 224
pool allocs: 7
That's a drastic difference. It's also surprising. I've expected the first version to be like memcopy, which is hard to beat for a large memory block.
The second version has overhead from the pointer arithmetic (getindex), the branch condition (<=) and the bounds check in each assignment. Yet each assignment takes just ~3 ns.
Also, the time which the garbage collector consumes, varies a lot for the first function. If no garbage collection is performed, the large difference becomes small, but it remains. It's still a factor of ~2.5 between version 3 and 2.
So why is the "memcopy" version not as efficient as the "assignment" version?
Firstly, your code contains a bug. Run this:
A = [1 2 ; 3 4]
B = [5 6 ; 7 8]
priv_memcopyBtoA2!(A, B, 2)
then:
julia> A
2×2 Array{Int64,2}:
5 2
7 4
You need to re-assign jj back to 1 at the end of each inner while loop, ie:
function priv_memcopyBtoA2!(A::Matrix{Int}, B::Matrix{Int}, n::Int)
ii = 1
while ii <= n
jj = 1
while jj <= n
A[jj,ii] = B[jj,ii]
jj += 1
end
ii += 1
end
return nothing
end
Even with the bug fix, you'll still note that the while loop solution is faster. This is because array slices in julia create temporary arrays. So in this line:
A[1:n,1:n] = B[1:n,1:n]
the right-hand side operation creates a temporary nxn array, and then assigns the temporary array to the left-hand side.
If you wanted to avoid the temporary array allocation, you would instead write:
A[1:n,1:n] = view(B, 1:n, 1:n)
and you'll notice that the timings of the two methods is now pretty close, although the while loop is still slightly faster. As a general rule, loops in Julia are fast (as in C fast), and explicitly writing out the loop will usually get you the most optimized compiled code. I would still expect the explicit loop to be faster than the view method.
As for the garbage collection stuff, that is just a result of your method of timing. Much better to use #btime from the package BenchmarkTools, which uses various tricks to avoid traps like timing garbage collection etc.
Why is A[1:n,1:n] = view(B, 1:n, 1:n) or variants of it, slower than a set of while loops? Let's look at what A[1:n,1:n] = view(B, 1:n, 1:n) does.
view returns an iterator which contains a pointer to the parent B and information how to compute the indices which should be copied. A[1:n,1:n] = ... is parsed to a call _setindex!(...). After that, and a few calls down the call chain, the main work is done by:
.\abstractarray.jl:883;
# In general, we simply re-index the parent indices by the provided ones
function getindex(V::SlowSubArray{T,N}, I::Vararg{Int,N}) where {T,N}
#_inline_meta
#boundscheck checkbounds(V, I...)
#inbounds r = V.parent[reindex(V, V.indexes, I)...]
r
end
#.\multidimensional.jl:212;
#inline function next(iter::CartesianRange{I}, state) where I<:CartesianIndex
state, I(inc(state.I, iter.start.I, iter.stop.I))
end
#inline inc(::Tuple{}, ::Tuple{}, ::Tuple{}) = ()
#inline inc(state::Tuple{Int}, start::Tuple{Int}, stop::Tuple{Int}) = (state[1]+1,)
#inline function inc(state, start, stop)
if state[1] < stop[1]
return (state[1]+1,tail(state)...)
end
newtail = inc(tail(state), tail(start), tail(stop))
(start[1], newtail...)
end
getindex takes a view V and an index I. We get the view from B and the index I from A. In each step reindex computes from the view V and the index I indices to get an element in B. It's called r and we return it. Finally r is written to A.
After each copy inc increments the index I to the next element in A and tests if one is done. Note that the code is from v0.63 but in master it's more or less the same.
In principle the code could be reduced to a set of while loops, yet it is more general. It works for arbitrary views of B and arbitrary slices of the form a:b:c and for an arbitrary number of matrix dimensions. The big N is in our case 2.
Since the functions are more complex, the compiler doesn't optimize them as well. I.e. there is a recommendation that the compiler should inline them, but it doesn't do that. This shows that the shown functions are non trivial.
For a set of loops the compiler reduces the innermost loop to three additions (each for a pointer to A and B and one for the loop index) and a single copy instruction.
tl;dr The internal call chain of A[1:n,1:n] = view(B, 1:n, 1:n) coupled with multiple dispatch is non trivial and handles the general case. This induces overhead. A set of while loops is already optimized to a special case.
Note that the performance depends on the compiler. If one looks at the one dimensional case A[1:n] = view(B, 1:n), it's faster than a while loop because it vectorizes the code. Yet for higher dimensions N >2 the difference grows.

Julia: use of pmap with Arrays vs SharedArrays

I have been working in Julia for a few months now and I am interested in writing some of my code in parallel. I am working on a problem where I use 1 model to generate data for several different receivers (the data for each receiver is a vector). The data for each receiver can be computed independently which leads me to believe I should be able to use the pmap function. My plan is to initialize the data as a 2D SharedArray (each column represents the data for 1 receiver) and then have pmap loop over each of the columns. However I am finding that using SharedArray's with pmap is no faster than working in serial using map. I wrote the following dummy code to illustrate this point.
#everywhere function Dummy(icol,model,data,A,B)
nx = 250
nz = 250
nh = 50
for ih = 1:nh
for ix = 1:nx
for iz = 1:nz
data[iz,icol] += A[iz,ix,ih]*B[iz,ix,ih]*model[iz,ix,ih]
end
end
end
end
function main()
nx = 250
nz = 250
nh = 50
nt = 500
ncol = 100
model1 = rand(nz,nx,nh)
model2 = copy(model1)
model3 = convert(SharedArray,model1)
data1 = zeros(Float64,nt,ncol)
data2 = SharedArray(Float64,nt,ncol)
data3 = SharedArray(Float64,nt,ncol)
A1 = rand(nz,nx,nh)
A2 = copy(A1)
A3 = convert(SharedArray,A1)
B1 = rand(nz,nx,nh)
B2 = copy(B1)
B3 = convert(SharedArray,B1)
#time map((arg)->Dummy(arg,model1,data1,A1,B1),[icol for icol = 1:ncol])
#time pmap((arg)->Dummy(arg,model2,data2,A2,B2),[icol for icol = 1:ncol])
#time pmap((arg)->Dummy(arg,model3,data3,A3,B3),[icol for icol = 1:ncol])
println(data1==data2)
println(data1==data3)
end
main()
I start the Julia session with Julia -p 3 and run the script. The times for the 3 tests are 1.4s, 4.7s, and 1.6s respectively. Using SharedArrays with pmap (1.6s runtime) hasn't provided any improvement in speed compared with regular Arrays with map (1.4s). I am also confused as to why the 2nd case (data as a SharedArray, all other inputs as a regular Array with pmap) is so slow. What do I need to change in order to benefit from working in parallel?
Preface: yes, there actually is a solution to your issue. See code at bottom for that. But, before I get there, I'll go into some explanation.
I think the root of the problem here is memory access. First off, although I haven't rigorously investigated it, I suspect that there are a moderate number of improvements that could be made to Julia's underlying code in order to improve the way that it handles memory access in parallel processing. Nevertheless, in this case, I suspect that any underlying issues with the base code, if such actually exist, aren't so much at fault. Instead, I think it is useful to think carefully about what exactly is going on in your code and what it means vis-a-vis memory access.
A key thing to keep in mind when working in Julia is that it stores Arrays in column-major order. That is, it stores them as stacks of columns on top of each other. This generalizes to dimensions > 2 as well. See this very helpful segment of the Julia performance tips for more info. The implication of this is that it is fast to access one row after another after another in a single column. But, if you need to be jumping around columns, then you get into trouble. Yes, accessing ram memory might be relatively fast, but accessing cache memory is much, much faster, and so if your code allows for a single column or so to be loaded from ram into cache and then worked on, then you'll do much better than if you need to be doing lots of swapping between ram and cache. Here in your code, you're switching from column to column between your computations like nobody's business. For instance, in your pmap each process gets a different column of the shared array to work on. Then, each goes down the rows of that column and modifies the values in it. But, since they are trying to work in parallel with one another, and the whole array is too big to fit into your cache, there is lots of swapping between ram and cache that happens and that really slows you down. In theory, perhaps a clever enough under-the-hood memory management system could be devised to address this, but I don't really know - that goes beyond my pay grade. The same thing of course is happening to your accesses to your other objects.
Another thing to keep in mind in general when parallelizing is your ratio of flops (i.e. computer calculations) to read/write operations. Flops tend to parallelize well, you can have different cores, processes, etc. doing their own little computations on their own bits of data that they hold in their tiny caches. But, read/write operations don't parallelize so well. There are some things that can be done to engineer hardware systems to improve on this. But in general, if you have a given computer system with say, two cores, and you add four more cores to it, your ability to perform flops will increase three fold, but your ability to read/write data to/from ram won't really improve so much. (note: this is an oversimplication, a lot depends on your system). Nevertheless, in general, the higher your ratio of flops to read/writes, the more benefits you can expect from parallelism. In your case, your code involves a decent number of read/writes (all of those accesses to your different arrays) for a relatively small number of flops (a few multiplactions and an addition). It's just something to keep in mind.
Fortunately, your case is amenable to some good speedups from parallelism if written correctly. In my experience with Julia, all of my most successful parallelism comes when I can break data up and have workers process chunks of it separately. Your case happens to be amenable to that. Below is an example of some code I wrote that does that. You can see that it gets nearly a 3x increase in speed going from one processor to three. The code a bit crude in places, but it at least demonstrates the general idea of how something like this could be approached. I give a few comments on the code afterwards.
addprocs(3)
nx = 250;
nz = 250;
nh = 50;
nt = 250;
#everywhere ncol = 100;
model = rand(nz,nx,nh);
data = SharedArray(Float64,nt,ncol);
A = rand(nz,nx,nh);
B = rand(nz,nx,nh);
function distribute_data(X, obj_name_on_worker::Symbol, dim)
size_per_worker = floor(Int,size(X,1) / nworkers())
StartIdx = 1
EndIdx = size_per_worker
for (idx, pid) in enumerate(workers())
if idx == nworkers()
EndIdx = size(X,1)
end
println(StartIdx:EndIdx)
if dim == 3
#spawnat(pid, eval(Main, Expr(:(=), obj_name_on_worker, X[StartIdx:EndIdx,:,:])))
elseif dim == 2
#spawnat(pid, eval(Main, Expr(:(=), obj_name_on_worker, X[StartIdx:EndIdx,:])))
end
StartIdx = EndIdx + 1
EndIdx = EndIdx + size_per_worker - 1
end
end
distribute_data(model, :model, 3)
distribute_data(A, :A, 3)
distribute_data(B, :B, 3)
distribute_data(data, :data, 2)
#everywhere function Dummy(icol,model,data,A,B)
nx = size(model, 2)
nz = size(A,1)
nh = size(model, 3)
for ih = 1:nh
for ix = 1:nx
for iz = 1:nz
data[iz,icol] += A[iz,ix,ih]*B[iz,ix,ih]*model[iz,ix,ih]
end
end
end
end
regular_test() = map((arg)->Dummy(arg,model,data,A,B),[icol for icol = 1:ncol])
function parallel_test()
#everywhere begin
if myid() != 1
map((arg)->Dummy(arg,model,data,A,B),[icol for icol = 1:ncol])
end
end
end
#time regular_test(); # 2.120631 seconds (307 allocations: 11.313 KB)
#time parallel_test(); # 0.918850 seconds (5.70 k allocations: 337.250 KB)
getfrom(p::Int, nm::Symbol; mod=Main) = fetch(#spawnat(p, getfield(mod, nm)))
function recombine_data(Data::Symbol)
Results = cell(nworkers())
for (idx, pid) in enumerate(workers())
Results[idx] = getfrom(pid, Data)
end
return vcat(Results...)
end
#time P_Data = recombine_data(:data); # 0.003132 seconds
P_Data == data ## true
Comments
The use of the SharedArray is quite superfluous here. I just used it since it lends itself easily to being modified in place, which is how your code is originally written. This let me work more directly based on what you wrote without modifying it as much.
I didn't include the step to bring the data back in the time trial, but as you can see, it's quite a trivial amount of time in this case. In other situations, it might be less trivial, but data movement is just one of those issues that you face with parallelism.
When doing time trials in general, it is considered best practice to run the function once (in order to compile the code) and then run it again to get the times. That's what I did here.
See this SO post for where I got inspiration for some of these functions that I used here.

Is it possible to remove the following !$OMP CRITICAL regions

I have a fortran code that shows some very unsatisfactory performance due to some $OMP CRITICAL regions. This question is actually more about how to the critical regions can be avoided and whether those regions can be removed? In those critical regions I am updating counters and reading/writing values to an array
i=0
j=MAX/2
total = 0
!$OMP PARALLEL PRIVATE(x,N)
MAIN_LOOP:do
$OMP CRITICAL
total = total + 1
x = array(i)
i = i + 1
if ( i > MAX) i=1 ! if the counter is past the end start form the beginning
$OMP END CRITICAL
if (total > MAX_TOTAL) exit
! do some calculations here and get the value of the integer (N)
! store (N) copies of x it back in the original array with some offset
!$OMP CRITICAL
do p=1,N
array(j)=x
j=j+1
if (j>MAX) j=1
end do
!$OMP END CRITICAL
end do MAIN_LOOP
$OMP END PARALLEL
One simple thing that came to my mind is to eliminate the counter on total by using explicit dynamic loop scheduling.
!$OMP PARALLEL DO SCHEDULE(DYNAMIC)
MAIN_LOOP:do total = 1,MAX_TOTAL
! do the calculation here
end do MAIN_LOOP
!$OMP END PARALLEL DO
I was also thinking to allocate different portion of the array to each thread and using the thread ID to do offsetting. This time each processor will have it's own counter which will be stored in an array count_i(ID) and something of the sort
!this time the size if array is NUM_OMP_THREADS*MAX
x=array(ID + sum(count_i)) ! get the offset by summing up all values
ID=omp_get_thread_num()
count_i(ID)=count_i(ID)+1
if (count_i(ID) > MAX) count_i(ID) = 1
This however will mess the order and will not do the same as the original method. Moreover some empty space will be present, since the different threads will not able to fit the entire range 1:MAX
I would appreciate your help and ideas.
Your use of critical sections is a bit strange here. The motivation for using critical sections must be to avoid having an entry in the array being clobbered before it can be read. Your code does accomplish this, but only accidentally, by acting as barriers. Try replacing the critical stuff with OMP barriers, and you should still get the right result and the same horrible speed.
Since you always write to the array half its length away from where you write to it, you can avoid critical sections by dividing the operation into one step which reads from the first half and writes to the second half, and vice versa. (Edit: After the question was edited, this is no longer true, so the approach below won't work).
nhalf = size(array)/2
!$omp parallel do
do i = 1, nhalf
array(i+nhalf) = f(array(i))
end do
!$omp parallel do
do i = 1, nhalf
array(i) = f(array(i+nhalf))
end do
Here f(x) represents whatever calculation you want to do to the array values. It doesn't have to be a function if you don't want it to. If it isn't clear, this code first loops through the entries in the first half of the array in parallel. The first task may go through i=1,1+nproc,1+2*nproc, etc. while the second task goes through i=2,2+nproc,2+2*nproc, and so on. This can be done in parallel without any locking because there is no overlap between the part of the array that is read from and written to in this loop. The second loop only starts once every task has finished the first loop, so there is no clobbering between the loops.
Unlike in your code, there is here one i per thread, so one doesn't need locking to update it (the loop variable is automatically private).
This assumes that you only want to make one pass through the array. Otherwise you can just loop over these two loops:
do iouter = 1, (max_total+size(array)-1)/size(array)
nleft = max_total-(iouter-1)*size(array)
nhalf = size(array)/2
!$omp parallel do
do i = 1, min(nhalf,nleft)
array(i+nhalf) = f(array(i))
end do
!$omp parallel do
do i = 1, min(nhalf,nleft-nhalf)
array(i) = f(array(i+nhalf))
end do
end do
Edit: Your new example is confusing. I'm not sure what it's supposed to do. Depending on the value of N, the array values may end being clobbered before they can be used. Is this intentional? It's hard to answer your question when it's not clear what you're trying to do. :/
I thought about this for a while and my feeling is that there is no good answer to this specific issue.
Indeed, your code seems, at first glance, like a good approach to the problem such as stated (although I personally find the problem itself a bit strange). However, there are problems in your implementation:
What happens if for some reason one of the threads gets delayed in processing its iteration? Just imagine that the thread owning very first index takes a while to process it (delayed y some third party process coming in the way and taking the CPU time on the core where the thread was pinned/scheduled for example) and is the last to finish... Then it will set back its values to array in a completely different order than what the sequential algorithm would have done. Is that something you can accept in your algorithm?
Even without this sort of "extreme" delay, can you accept that the order in which the i indexes were distributed among threads is different that the order in which the j indexes are subsequently updated? If the thread owning i+1 finishes right before the one owning i, it will use index j instead of index j+n as it should have had...
Again, I'm not sure I understand all the subtleties of your algorithm and how resilient it is to miss-ordering of iterations, but if ordering is something important, then the approach is wrong. In this case, I guess that a proper parallelisation could be something like this (put in a subroutine to make it compilable):
subroutine loop(array, maxi, max_iteration)
implicit none
integer, intent(in) :: maxi, max_iteration
real, intent(inout) :: array(maxi)
real :: x
integer :: iteration, i, j, n, p
i = 0
j = maxi/2
!$omp parallel do ordered private(x, n, p) schedule(static,1)
do iteration = 1,max_iteration
!$omp ordered
x = array(wrap_around(i, maxi))
!$omp end ordered
! do some calculations here and get the value of the integer (n)
!$omp ordered
do p = 1,n
array(wrap_around(j, maxi)) = x
end do
!$omp end ordered
end do
!$omp end parallel do
contains
integer function wrap_around(i, maxi)
implicit none
integer, intent(in) :: maxi
integer, intent(inout) :: i
i = i+1
if (i > maxi) i = 1
wrap_around = i
end function wrap_around
end subroutine loop
I hope this would work. However, unless the central part of the loop where n is retrieved does some serious computation, this won't be any faster than the sequential version.

find match between random lists as fast as possible in fortran (+openmp)

I have two lists, I need to match elements in one against the other and output those elements into a new matrix (output). What is the fastest way to this in Fortran? Brute force so far:
do i = 1,Nlistone
do j = 1,Nlisttwo
if A(i).eq.B(j) then
output(i) = B(j)
end if
end do
end do
openmp version:
!$OMP PARALLEL PRIVATE(i,j)
do i = 1,NA
do j = 1,NB
if A(i).eq.B(j) then
filtered(i) = A(j)
end if
end do
end do
!$OMP END PARALLEL DO
There are definitely better ways to do this and sorting is not an option here sorry (+ the vector elements are not in any particular order). Is there a boolean argument similar to mask in python?
It might be faster to use the intrinsic ANY function something like this
do i = 1,Nlistone
if (any(B==A(i))) output(i) = A(i)
end do
but I wouldn't bet on this improving performance, I'd test both versions. You should be able to safely wrap this inside an !$OMP PARALLEL DO construct since each element of output is only written to by one thread.
You can also include the condition testing in a forall or (newer) do concurrent construct, which both use the same syntax:
do concurrent(i = 1:Nlistone, any(a(i) == b))
output(i) = a(i)
end do
This might be faster than your code listing, depending on the nature A and B. But it still has the same time complexity of O(n^2), while sorting can be O(n*log(n)). If Nlistone /= Nlisttwo, you could gain a factor equal to the ratio of these sizes, by only looping through the shorter array and matching its elements to the longer one.

Resources