Zero-overhead loop in DSP - parallel-processing

Blackfin does zero overhead looping in circular buffer.How does it realize that the number of iterations has been completed without a comparison or decrementation of a counter?

From blackfin documentation:
The sequencer supports a mechanism of zero-overhead looping. The
sequencer contains two loop units, each containing three registers. Each
loop unit has a Loop Top register ( LT0 , LT1 ), a Loop Bottom register (LB0 ,
LB1), and a Loop Count register (LC0, LC1).
Two sets of zero-overhead loop registers implement loops, using hardware
counters instead of software instructions to evaluate loop conditions. After
evaluation, processing branches to a new target address. Both sets of regis-
ters include the Loop Counter ( LC), Loop Top (LT ), and Loop Bottom
(LB ) registers.
This works if Loop Counter is decremented every time that the loop is entered.
Then when the program counter reaches Loop Bottom, it is sufficient to test the value of this counter to branch
to the next instruction and leave the loop if Loop Counter == 0
to Loop Top to continue iterations if Loop Counter != 0. It is also required to decrement Loop Counter simultaneously.
This ensures zero-overhead looping.

Related

last warp loop unrolling in Nvidia's parallel reduction tutorial problem

I ran into a problem for understanding the logic behind "the last warp loop unrolling" technique in Nvidia's parallel reduction tutorial available here.
In case of thread31 (for which tid=31), before unrolling the loop:
this thread only executes these operations:
sdata[31] += sdata[31+64]
sdata[31] += sdata[31+32]
But after the loop unrolling (as shown below):
The condition if(tid < 32) becomes true for thread31 and the warpReduce function will be executed for it and therefore all these operations which wouldn't be executed in the unrolled loop version will be executed now:
sdata[31] += sdata[31+32] //for second time
sdata[31] += sdata[31+16]
...
sdata[31] += sdata[31+1]
What's the logic behind it?
First:
sdata[31] += sdata[31+32] //for second time
No, that's not the case, it doesn't get executed a second time. The loop terminates when the s variable is shifted right from 64 to 32, and the body of the loop is not executed for s=32. Therefore the above statement is not executed during the body of the loop, because that would imply s=32, which is excluded by the loop termination condition.
Now, on to your question. It's true there is a behavioral difference between the two cases, however the only result that matters at the end is sdata[0] and this behavioral difference does not affect the results calculation for sdata[0]. So the only thing left would be "does it matter for performance?"
I don't have an answer for you, but I doubt it would make a significant difference. In the non-warp-reduce case, at each loop iteration there is a shift-right operation on a register variable, followed by a test, followed by a predicated set of shared memory instructions. In the warp-reduce case, there is some extra shared memory load/store activity and add arithmetic, but no shift arithmetic or testing per reduction step.
With respect to the extra load/store activity, the only portion of this that matters is the portion that will reach "above" the warp range (i.e. 0-31). There is extra shared loading activity going on here. The extra store activity and extra add arithmetic is irrelevant, because constraining these operations to less than a single warp is not any better performance-wise (this point is covered in the presentation itself, "We don’t need if (tid < s) because it doesn’t save any
work"). So the only consideration here is the once-per-step "extra" read of shared memory, one additional transaction, basically, per step. Against that we have the shifting, conditional test, and predication.
I don't know which is faster, but my guess as to the "logic" would be:
The difference would be small. Shared memory pressure is unlikely to be an issue at this point in this code.
The person who wrote it either didn't consider this at all, or considered it and decided it was probably so trivial as to be not worthy of cluttering a presentation that is really focused on other things, and will be read by many people.
EDIT: Based on comments, there appears to still be some question about my claim that the behavioral difference does not affect the results calculdation for sdata[0].
First, let's acknowledge that the only item we care about at the end is sdata[0]. sdata[1] or any other "result" is irrelevant for this discussion.
Let's make an observation about which thread calculations matter, at each step. We can observe that at a given step in the final-warp reduction, the only threads that matter (i.e. that can have an effect on the final value in sdata[0]) are those that are less then the offset value:
sdata[tid] += sdata[tid + offset]; // where offset is 32, then 16, then 8, etc.
Why is this? In order to understand that, we need to understand 2 things. First, we must understand at this point that there is an expectation of warp-synchronous behavior. This is already identified in the presentation (slide 21) as a necessary precondition to convert the loop reduction to the unrolled final warp reduction. I'm not going to spend a lot of time on the definition of warp-synchronous, but it essentially means we are depending on the warp to execute in lockstep. A warp is 32 threads, and it means that when one thread is executing a particular instruction, every thread in the warp is executing that instruction, at that point in the instruction stream. Second, we need to carefully decompose the above line to understand the sequence of operations. The above line of C++ code will decompose into the following pseudo-machine-language code that the GPU is actually executing:
LD R0, sdata[tid]
LD R1, sdata[tid+offset]
ADD R3, R2, R1
ST sdata[tid], R3
In english, at each step in the final warp unrolled reduction, each thread will load its sdata[tid] value, then each thread will load its sdata[tid+offset] value, then each thread will add those 2 values together, then each thread will store the result. Because the warp is executing in lockstep at this point, when each thread loads its sdata[tid] value, it means that every thread is loading its respective value, at that instruction cycle/clock cycle, i.e. at that instant.
now, lets revisit the overall operation. At the point in the sequence where we have:
sdata[tid] += sdata[tid + 16];
how can we justify the statement that the only threads here that matter are those whose tid value is less than the offset? The first thing each thread does is load sdata[tid]. Then each thread loads sdata[tid+16]. So at this point, threads 0-15 have loaded their own value, plus the values from locations 16-31. Threads 16-31 have loaded their own value, plus the values from locations 32-47. Then all 32 threads perform the addition, then all 32 threads perform the store operation. So thread 16, which also picked up the value from location 32, did not update the location 16 value until after the previous value at location 16 had been consumed (by thread 0 in this case). So the behavior of threads 16-31 at this point have no impact on the value computed for thread 0.
We can repeat the above process to show that for each offset, the threads whose indexes lie at or above the offset have no impact on the calculation for thread 0.

Multi-threaded multi GPU computation using openMP and openACC

I'm trying to write a code that will port openmp thread to a single gpu. I found very less case studies /codes on this.Since I`m not from computer science background.
I have less skills in programming.
This is how the basic idea look's like
And this is the code so far developed.
CALL OMP_SET_NUM_THREADS(2)
!$omp parallel num_threads(acc_get_num_devices(acc_device_nvidia))
do while ( num.gt.iteration)
id = omp_get_thread_num()
call acc_set_device_num(id+1, acc_device_nvidia)
!!$acc kernels
!error=0.0_rk
!!$omp do
!$acc kernels
!!$omp do
do j=2,nj-1
!!$acc kernels
do i=2,ni-1
T(i,j)=0.25*(T_o(i+1,j)+T_o(i-1,j)+ T_o(i,j+1)+T_o(i,j-1) )
enddo
!!$acc end kernels
enddo
!!$omp end do
!$acc end kernels
!!$acc update host(T,T_o)
error=0.0_rk
do j=2,nj-1
do i=2,ni-1
error = max( abs(T(i,j) - T_o(i,j)), error)
T_o(i,j) = T(i,j)
enddo
enddo
!!$acc end kernels
!!$acc update host(T,T_o,error)
iteration = iteration+1
print*,iteration , error
!print*,id
enddo
!$omp end parallel
There's a number of issues here.
First, you can't put an OpenMP (or OpenACC) parallel loop on a do while. Do while have indeterminant number to iterations therefor create a dependency in that exiting the loop depends on the previous iteration of the loop. You need to use a DO loop where the number of iterations is known upon entry into the loop.
Second, even if you convert this to a DO loop, you'd get a race condition if run in parallel. Each OpenMP thread would be assigning values to the same elements of the T and T_o arrays. Plus the results of T_o is used as input to the next iteration creating a dependency. In other words, you'd get wrong answers if you tried to parallelize the outer iteration loop.
For the OpenACC code, I'd suggest adding a data region around the iteration loop, i.e. "!$acc data copy(T,T_o) " before the iteration loop and then after the loop "!$acc end data", so that the data is created on the device only once. As you have it now, the data would be implicitly created and copied each time through the iteration loop causing unnecessary data movement. Also add a kernels region around the max error reduction loop so this is offloaded as well.
In general, I prefer using MPI+OpenCC for multi-GPU programming rather than OpenMP. With MPI, the domain decomposition is inherent and you then have a one-to-one mapping of MPI rank to a device. Not that OpenMP can't work, but you then often need to manually decompose the domain. Also trying to manage multiple device memories and keep them in sync can be tricky. Plus with MPI, your code can also go across nodes rather than be limited to a single node.

Why loop is so faster?

The following loop in fortran almost takes no time
j=0
do i=1,1000000000000000000
j=j+1
end do
print*,j
But I just don't understand, our cpu is about GHz, which means 10^9 cycle in a second, while the above loop cycle is way too much than 10^9, why it almost takes no time?
It seems that the values is not computed at compiled time. We can add outer loop, until
do m=1,1000000000
do i=1,1000000000000000000
j=j+1
end do
end do
print*,j
Now it takes a second on my computer
Edit
I am using windows, intel parallel studio 15, with no extra compilation option: simply ifort test.f90. Timing method is simple, just wait after I press Enter in command line to execute the .exe
don't know fortran, but if this would be C, the compiler could optimize the above code removing the loop altogether as the value of j can be computed at compile time.
So the above code would be reduced to
print 1000000000000000000
Your logic about cycles and instructions is flawed. Modern CPUs parallelize code on hardware level, even if the code is serial:
a cpu has more a few ALU who can compute arithmetic instructions in parallel
instructions are executed in a pipeline, so at any one point, different stages of consecutive instructions are executed in parallel.
So "max of one instruction per cycle" doesn't hold.
Also increment by one is one of the fastest instruction in the CPU.

Parallelising only inner for loop

I have a for loop which takes around 16 ms to execute and it is executed conditionally under another for loop for 500 times.
Serial code format is like this:
//Outer for loop
for(i=0;i<500;i++){
//read some entity
//some conditions
// some function calls
// some nested function calls
// inner for loop
for (j=0;some condition;j++){
// work on the entity read in outer for loop
}
}
I want to parallelize the inner for loop. Is it possible by making use of OpenMP to reduce the time required to execute inner for loop by 40% and hence the total time required to run the serial code?
I want overall time reduction to execute the code. Paralleizing outer for loop is not possible in my case since the code is written to read only one entity at a time to work on
it in the inner for loop.
Please help.
Thanks!
Openmp can paralellize such small tasks. i have done this once to do a 5x5 kernel filter on 30 fps video.
You should test what the best granularity is. If you divide the task in two, you have the least overhead, but you limit the paralelism. If the granularity is too high, loop ovrhead gets bigger, and you could be writing to adjacent memory locations from different cores, which ruins cache performance.
In my example above, I divided the image in scanlines, each of which was computed sequentially. This worked fine.

How to obtain the maximum of a number in Simulink?

I am building a model which requires me to find the maximum of a set of 8 signals, also find the index of the maximum value.
How can I build such a model in Simulink(Xilinx library) ?
I am guessing Compare block followed by a counter block. But somehow, I am not able to figure all things together.
Thanks
One way which gets it all done in parallel:
You need to build a tree of comparators and multiplexers:
Start with a block which takes in two values and two indices and passes out the index and the value of the larger. One comparator, 2 muxes per block.
On the first level of your tree you have 4 of these blocks feeding into
a second level of 2 of these blocks, the results of which feed into
a final block which produces your answers
This can be pipelined so you can pour data through it as fast as you like. But it'll use a fair amount of resource. How wide are your signals? Each comparator is 1 LUT4 per bit and a 2:1 mux is 1 LUT4 per bit.
Alternatively, you use a counter to select each of your values in turn. If it's bigger than the current biggest, latch the value into your "biggest" register and latch the counter into a "biggest index" register. Reset the "biggest" register to the smallest value each time your counter resets.
This will take as many clock cycles as you have signal (8 in your case)

Resources