In my book of parallel algorithms there is the following pseudo-code for the PRAM model:
procedure PrefixSumPRAM( A, n ):
BEGIN
b := new Array(2*n-1);
b[1] := SumPRAM(A, n); //this will load A with the computation tree and return the sum
for i := 1 to ( log2(n) - 1 ) do
BEGIN
for all procID where (2^i) <= procID <= ((2^(i+1))-1) do in parallel
BEGIN
if odd(procID) then
b[ procID ] := b[ procID/2 ];
else
b[ procID ] := b[ procID/2 ] - a[ procID+1 ];
END
END
END
but...PRAM specify that all processors must execute the same instruction on different data.
So this program is executable only on a CREW PRAM model?
or is executable on an EREW model then the processors with odd ID will execute
b[procID]:=b[procID/2];
when the processors with even id execute a (for example) NOP instruction?
In the PRAM model, there are an unbounded number of processors and a single memory. Although the processors operate in lock-step by executing one instruction per time step, each processor maintains its own state and can therefore execute the program in an arbitrary way according to the control flow.
In a CREW PRAM, two processors can read from the same memory location in the same time step, but only one processor can write to any memory location in one step. In an EREW PRAM, reads can also not occur concurrently.
Related
I've noticed a strange behavior of julia during a matrix copy.
Consider the following three functions:
function priv_memcopyBtoA!(A::Matrix{Int}, B::Matrix{Int}, n::Int)
A[1:n,1:n] = B[1:n,1:n]
return nothing
end
function priv_memcopyBtoA2!(A::Matrix{Int}, B::Matrix{Int}, n::Int)
ii = 1; jj = 1;
while ii <= n
jj = 1 #(*)
while jj <= n
A[jj,ii] = B[jj,ii]
jj += 1
end
ii += 1
end
return nothing
end
function priv_memcopyBtoA3!(A::Matrix{Int}, B::Matrix{Int}, n::Int)
A[1:n,1:n] = view(B, 1:n, 1:n)
return nothing
end
Edit: 1) I tested if the code would throw an BoundsError so the line marked with jj = 1 #(*) was missing in the initial code. The testing results were already from the fixed version, so they remain unchanged. 2) I've added the view variant, thanks to #Colin T Bowers for addressing both issues.
It seems like both functions should lead to more or less the same code. Yet I get for
A = fill!(Matrix{Int32}(2^12,2^12),2); B = Int32.(eye(2^12));
the results
#timev priv_memcopyBtoA!(A,B, 2000)
0.178327 seconds (10 allocations: 15.259 MiB, 85.52% gc time)
elapsed time (ns): 178326537
gc time (ns): 152511699
bytes allocated: 16000304
pool allocs: 9
malloc() calls: 1
GC pauses: 1
and
#timev priv_memcopyBtoA2!(A,B, 2000)
0.015760 seconds (4 allocations: 160 bytes)
elapsed time (ns): 15759742
bytes allocated: 160
pool allocs: 4
and
#timev priv_memcopyBtoA3!(A,B, 2000)
0.043771 seconds (7 allocations: 224 bytes)
elapsed time (ns): 43770978
bytes allocated: 224
pool allocs: 7
That's a drastic difference. It's also surprising. I've expected the first version to be like memcopy, which is hard to beat for a large memory block.
The second version has overhead from the pointer arithmetic (getindex), the branch condition (<=) and the bounds check in each assignment. Yet each assignment takes just ~3 ns.
Also, the time which the garbage collector consumes, varies a lot for the first function. If no garbage collection is performed, the large difference becomes small, but it remains. It's still a factor of ~2.5 between version 3 and 2.
So why is the "memcopy" version not as efficient as the "assignment" version?
Firstly, your code contains a bug. Run this:
A = [1 2 ; 3 4]
B = [5 6 ; 7 8]
priv_memcopyBtoA2!(A, B, 2)
then:
julia> A
2×2 Array{Int64,2}:
5 2
7 4
You need to re-assign jj back to 1 at the end of each inner while loop, ie:
function priv_memcopyBtoA2!(A::Matrix{Int}, B::Matrix{Int}, n::Int)
ii = 1
while ii <= n
jj = 1
while jj <= n
A[jj,ii] = B[jj,ii]
jj += 1
end
ii += 1
end
return nothing
end
Even with the bug fix, you'll still note that the while loop solution is faster. This is because array slices in julia create temporary arrays. So in this line:
A[1:n,1:n] = B[1:n,1:n]
the right-hand side operation creates a temporary nxn array, and then assigns the temporary array to the left-hand side.
If you wanted to avoid the temporary array allocation, you would instead write:
A[1:n,1:n] = view(B, 1:n, 1:n)
and you'll notice that the timings of the two methods is now pretty close, although the while loop is still slightly faster. As a general rule, loops in Julia are fast (as in C fast), and explicitly writing out the loop will usually get you the most optimized compiled code. I would still expect the explicit loop to be faster than the view method.
As for the garbage collection stuff, that is just a result of your method of timing. Much better to use #btime from the package BenchmarkTools, which uses various tricks to avoid traps like timing garbage collection etc.
Why is A[1:n,1:n] = view(B, 1:n, 1:n) or variants of it, slower than a set of while loops? Let's look at what A[1:n,1:n] = view(B, 1:n, 1:n) does.
view returns an iterator which contains a pointer to the parent B and information how to compute the indices which should be copied. A[1:n,1:n] = ... is parsed to a call _setindex!(...). After that, and a few calls down the call chain, the main work is done by:
.\abstractarray.jl:883;
# In general, we simply re-index the parent indices by the provided ones
function getindex(V::SlowSubArray{T,N}, I::Vararg{Int,N}) where {T,N}
#_inline_meta
#boundscheck checkbounds(V, I...)
#inbounds r = V.parent[reindex(V, V.indexes, I)...]
r
end
#.\multidimensional.jl:212;
#inline function next(iter::CartesianRange{I}, state) where I<:CartesianIndex
state, I(inc(state.I, iter.start.I, iter.stop.I))
end
#inline inc(::Tuple{}, ::Tuple{}, ::Tuple{}) = ()
#inline inc(state::Tuple{Int}, start::Tuple{Int}, stop::Tuple{Int}) = (state[1]+1,)
#inline function inc(state, start, stop)
if state[1] < stop[1]
return (state[1]+1,tail(state)...)
end
newtail = inc(tail(state), tail(start), tail(stop))
(start[1], newtail...)
end
getindex takes a view V and an index I. We get the view from B and the index I from A. In each step reindex computes from the view V and the index I indices to get an element in B. It's called r and we return it. Finally r is written to A.
After each copy inc increments the index I to the next element in A and tests if one is done. Note that the code is from v0.63 but in master it's more or less the same.
In principle the code could be reduced to a set of while loops, yet it is more general. It works for arbitrary views of B and arbitrary slices of the form a:b:c and for an arbitrary number of matrix dimensions. The big N is in our case 2.
Since the functions are more complex, the compiler doesn't optimize them as well. I.e. there is a recommendation that the compiler should inline them, but it doesn't do that. This shows that the shown functions are non trivial.
For a set of loops the compiler reduces the innermost loop to three additions (each for a pointer to A and B and one for the loop index) and a single copy instruction.
tl;dr The internal call chain of A[1:n,1:n] = view(B, 1:n, 1:n) coupled with multiple dispatch is non trivial and handles the general case. This induces overhead. A set of while loops is already optimized to a special case.
Note that the performance depends on the compiler. If one looks at the one dimensional case A[1:n] = view(B, 1:n), it's faster than a while loop because it vectorizes the code. Yet for higher dimensions N >2 the difference grows.
Why is the time for all instructions in a sequential block (non-parallel) all the same?
i.e.
module abc;
reg [31:0] r;
initial
begin
r = 0;
$display($time, " ", r);
r = 1;
$display($time, " ", r);
r = r + 2;
$display($time, " ", r);
$finish;
end
endmodule
Output:
0 x
0 0
0 2
Verilog is a language designed to describe models of hardware and test code for exercising those models that can be run in a simulator (it was later re-purposed as a language to describe hardware for logic synthisis tools).
"time" refers not to the real world the simulator is running in but to the simulated world inside the simulator. Roughly speaking time in the simulated world only moves forward when there is nothing left to do at the current time point.
Verilog description of hardware consists of procedural blocks. These blocks are executed in pseudo-parallel fashion relatively to each other. The code inside every block is simulated sequentially in the same time slot.
such procedural blocks are all 'alsways' blocks, initial block and final block. You are testing the initial block. It is special and is executed, as the name suggests, at the very beginning of the simulation. All statements sequentially and at time '0'.
for always blocks, the time will be non-zero, but still the same for all instructions in the same block.
If you want to see time differences in an initial block, you need to add delays, i.e.
initial
begin
r = 0;
$display($time, " ", r);
#1
r = 1;
$display($time, " ", r);
#1
r = r + 2;
$display($time, " ", r);
$finish;
end
in the above example i added two 1-cycle delays. You should see the time incrementing in your case. Still all instructions are executed sequentially, the delay just stops execution for one cycle.
To see a parallel behavior you would need a real hardware description with always blocks and you need to simulate it for multiple cycles. Then you might notice that the order of prints between different always blocks will vary, depending on the state of the simulation. However even in this case simulator will finish simulation for all blocks for time 'a' before it starts simulation for other blocks for time 'b'.
I am new to Julia and studying Julia parallel computing recently.
I am still not clear about the accurate mechanism of Julia's parallelism including macros \#sync and \#async after I read the relevant documents.
The following is the pmap function from the Julia v0.5 documentation:
function pmap(f, lst)
np = nprocs() # determine the number of processes available
n = length(lst)
results = Vector{Any}(n)
i = 1
# function to produce the next work item from the queue.
# in this case it's just an index.
nextidx() = (idx=i; i+=1; idx)
#sync begin
for p=1:np
if p != myid() || np == 1
#async begin
while true
idx = nextidx()
if idx > n
break
end
results[idx] = remotecall_fetch(f, p, lst[idx])
end
end
end
end
end
results
end
Is it possible for different two processors/workers call nextidx() at the same time getting the same idx = j? If yes, I feel results[j] will be computed twice and result[j+1] will not be computed.
Thanks very much.
More findings:
function f1()
i=1
nextidx()=(idx=i;sleep(1);i+=1;idx)
for p=1:2
#async begin
idx=nextidx()
println(idx)
end
end
end
f1()
The result is 1 1.
Through this I find the time periods during which the two tasks call the function nextidx() could overlap. So I feel that in the first code,
if np = 3 (i.e. two workers), and the length n of lst is very large, say 10^8,
it's possible for the tasks to get the same index.
It may happen just because of a coincidence in time, i.e., the two tasks take the expression idx = i at almost the same time point, so the code is not stable.
Am I right?
No, so long as you schedule jobs correctly. The pmap documentation points out that the master process schedules the jobs serially; only the parallel execution is asynchronous. pmap as coded requires no thread locks to ensure correct job scheduling. Adding sleep to nextidx deliberately breaks this feature and introduces the race condition that you observe. Assuming that all of the processes share state -- which is the purpose of the nextidx function -- then the master process will not schedule the same job twice.
Say you have the following code,
signal a : std_logic_vector( N - 1 downto 0 );
Or,
for i in 0 to N - 1
...
Does the N-1 part get synthesized into actual logic, or does the FPGA software (e.g. Quartus) do the math before creating the logic?
Example use case, say generic N represents the number of desired bits. Is it better to have a generic Nm1 which holds the subtraction, or can I get by using N-1 as above without it creating additional logic?
As N is a generic parameter, it is considered as a constant at synthesis time. Your logic synthesizer, as all logic synthesizers I know, will propagate constants before inferring any piece of hardware. The synthesizer will compute the N-1 expression before synthesis and this operation will not cost you any single transistor. It would be the same with a much more complex operation on constants like, for instance, calling a function. Example:
function log2_up(x: positive) return natural is
begin
if x = 1 then
return 0;
else
return log2_up((x+1)/2) + 1;
end if;
end function log2_up;
...
constant word_width: positive := 64;
constant write_byte_enable_width: positive := log2_up(64 / 8);
is synthesizable by all logic synthesizers I know and computing write_byte_enable_width does not cost a single transistor; it is computed beforehand by the synthesizer during the constant propagation phase.
In both of these cases, the value of N would have to be calculated at compile time. It isn't -- and cannot be! -- generated in logic.
Keep in mind that any HDL code ultimately has to be able to synthesize to physical hardware. Changing the number of bits in a signal, or changing the number of times some logic is instantiated, isn't something that can be done electronically.
I have a fortran code that shows some very unsatisfactory performance due to some $OMP CRITICAL regions. This question is actually more about how to the critical regions can be avoided and whether those regions can be removed? In those critical regions I am updating counters and reading/writing values to an array
i=0
j=MAX/2
total = 0
!$OMP PARALLEL PRIVATE(x,N)
MAIN_LOOP:do
$OMP CRITICAL
total = total + 1
x = array(i)
i = i + 1
if ( i > MAX) i=1 ! if the counter is past the end start form the beginning
$OMP END CRITICAL
if (total > MAX_TOTAL) exit
! do some calculations here and get the value of the integer (N)
! store (N) copies of x it back in the original array with some offset
!$OMP CRITICAL
do p=1,N
array(j)=x
j=j+1
if (j>MAX) j=1
end do
!$OMP END CRITICAL
end do MAIN_LOOP
$OMP END PARALLEL
One simple thing that came to my mind is to eliminate the counter on total by using explicit dynamic loop scheduling.
!$OMP PARALLEL DO SCHEDULE(DYNAMIC)
MAIN_LOOP:do total = 1,MAX_TOTAL
! do the calculation here
end do MAIN_LOOP
!$OMP END PARALLEL DO
I was also thinking to allocate different portion of the array to each thread and using the thread ID to do offsetting. This time each processor will have it's own counter which will be stored in an array count_i(ID) and something of the sort
!this time the size if array is NUM_OMP_THREADS*MAX
x=array(ID + sum(count_i)) ! get the offset by summing up all values
ID=omp_get_thread_num()
count_i(ID)=count_i(ID)+1
if (count_i(ID) > MAX) count_i(ID) = 1
This however will mess the order and will not do the same as the original method. Moreover some empty space will be present, since the different threads will not able to fit the entire range 1:MAX
I would appreciate your help and ideas.
Your use of critical sections is a bit strange here. The motivation for using critical sections must be to avoid having an entry in the array being clobbered before it can be read. Your code does accomplish this, but only accidentally, by acting as barriers. Try replacing the critical stuff with OMP barriers, and you should still get the right result and the same horrible speed.
Since you always write to the array half its length away from where you write to it, you can avoid critical sections by dividing the operation into one step which reads from the first half and writes to the second half, and vice versa. (Edit: After the question was edited, this is no longer true, so the approach below won't work).
nhalf = size(array)/2
!$omp parallel do
do i = 1, nhalf
array(i+nhalf) = f(array(i))
end do
!$omp parallel do
do i = 1, nhalf
array(i) = f(array(i+nhalf))
end do
Here f(x) represents whatever calculation you want to do to the array values. It doesn't have to be a function if you don't want it to. If it isn't clear, this code first loops through the entries in the first half of the array in parallel. The first task may go through i=1,1+nproc,1+2*nproc, etc. while the second task goes through i=2,2+nproc,2+2*nproc, and so on. This can be done in parallel without any locking because there is no overlap between the part of the array that is read from and written to in this loop. The second loop only starts once every task has finished the first loop, so there is no clobbering between the loops.
Unlike in your code, there is here one i per thread, so one doesn't need locking to update it (the loop variable is automatically private).
This assumes that you only want to make one pass through the array. Otherwise you can just loop over these two loops:
do iouter = 1, (max_total+size(array)-1)/size(array)
nleft = max_total-(iouter-1)*size(array)
nhalf = size(array)/2
!$omp parallel do
do i = 1, min(nhalf,nleft)
array(i+nhalf) = f(array(i))
end do
!$omp parallel do
do i = 1, min(nhalf,nleft-nhalf)
array(i) = f(array(i+nhalf))
end do
end do
Edit: Your new example is confusing. I'm not sure what it's supposed to do. Depending on the value of N, the array values may end being clobbered before they can be used. Is this intentional? It's hard to answer your question when it's not clear what you're trying to do. :/
I thought about this for a while and my feeling is that there is no good answer to this specific issue.
Indeed, your code seems, at first glance, like a good approach to the problem such as stated (although I personally find the problem itself a bit strange). However, there are problems in your implementation:
What happens if for some reason one of the threads gets delayed in processing its iteration? Just imagine that the thread owning very first index takes a while to process it (delayed y some third party process coming in the way and taking the CPU time on the core where the thread was pinned/scheduled for example) and is the last to finish... Then it will set back its values to array in a completely different order than what the sequential algorithm would have done. Is that something you can accept in your algorithm?
Even without this sort of "extreme" delay, can you accept that the order in which the i indexes were distributed among threads is different that the order in which the j indexes are subsequently updated? If the thread owning i+1 finishes right before the one owning i, it will use index j instead of index j+n as it should have had...
Again, I'm not sure I understand all the subtleties of your algorithm and how resilient it is to miss-ordering of iterations, but if ordering is something important, then the approach is wrong. In this case, I guess that a proper parallelisation could be something like this (put in a subroutine to make it compilable):
subroutine loop(array, maxi, max_iteration)
implicit none
integer, intent(in) :: maxi, max_iteration
real, intent(inout) :: array(maxi)
real :: x
integer :: iteration, i, j, n, p
i = 0
j = maxi/2
!$omp parallel do ordered private(x, n, p) schedule(static,1)
do iteration = 1,max_iteration
!$omp ordered
x = array(wrap_around(i, maxi))
!$omp end ordered
! do some calculations here and get the value of the integer (n)
!$omp ordered
do p = 1,n
array(wrap_around(j, maxi)) = x
end do
!$omp end ordered
end do
!$omp end parallel do
contains
integer function wrap_around(i, maxi)
implicit none
integer, intent(in) :: maxi
integer, intent(inout) :: i
i = i+1
if (i > maxi) i = 1
wrap_around = i
end function wrap_around
end subroutine loop
I hope this would work. However, unless the central part of the loop where n is retrieved does some serious computation, this won't be any faster than the sequential version.