Parallelising only inner for loop

Parallelising only inner for loop - openmp

I have a for loop which takes around 16 ms to execute and it is executed conditionally under another for loop for 500 times.
Serial code format is like this:
//Outer for loop
for(i=0;i<500;i++){
//read some entity
//some conditions
// some function calls
// some nested function calls
// inner for loop
for (j=0;some condition;j++){
// work on the entity read in outer for loop
}
}
I want to parallelize the inner for loop. Is it possible by making use of OpenMP to reduce the time required to execute inner for loop by 40% and hence the total time required to run the serial code?
I want overall time reduction to execute the code. Paralleizing outer for loop is not possible in my case since the code is written to read only one entity at a time to work on
it in the inner for loop.
Please help.
Thanks!

Openmp can paralellize such small tasks. i have done this once to do a 5x5 kernel filter on 30 fps video.
You should test what the best granularity is. If you divide the task in two, you have the least overhead, but you limit the paralelism. If the granularity is too high, loop ovrhead gets bigger, and you could be writing to adjacent memory locations from different cores, which ruins cache performance.
In my example above, I divided the image in scanlines, each of which was computed sequentially. This worked fine.

Related

last warp loop unrolling in Nvidia's parallel reduction tutorial problem

I ran into a problem for understanding the logic behind "the last warp loop unrolling" technique in Nvidia's parallel reduction tutorial available here.
In case of thread31 (for which tid=31), before unrolling the loop:
this thread only executes these operations:
sdata[31] += sdata[31+64]
sdata[31] += sdata[31+32]
But after the loop unrolling (as shown below):
The condition if(tid < 32) becomes true for thread31 and the warpReduce function will be executed for it and therefore all these operations which wouldn't be executed in the unrolled loop version will be executed now:
sdata[31] += sdata[31+32] //for second time
sdata[31] += sdata[31+16]
...
sdata[31] += sdata[31+1]
What's the logic behind it?

First:
sdata[31] += sdata[31+32] //for second time
No, that's not the case, it doesn't get executed a second time. The loop terminates when the s variable is shifted right from 64 to 32, and the body of the loop is not executed for s=32. Therefore the above statement is not executed during the body of the loop, because that would imply s=32, which is excluded by the loop termination condition.
Now, on to your question. It's true there is a behavioral difference between the two cases, however the only result that matters at the end is sdata[0] and this behavioral difference does not affect the results calculation for sdata[0]. So the only thing left would be "does it matter for performance?"
I don't have an answer for you, but I doubt it would make a significant difference. In the non-warp-reduce case, at each loop iteration there is a shift-right operation on a register variable, followed by a test, followed by a predicated set of shared memory instructions. In the warp-reduce case, there is some extra shared memory load/store activity and add arithmetic, but no shift arithmetic or testing per reduction step.
With respect to the extra load/store activity, the only portion of this that matters is the portion that will reach "above" the warp range (i.e. 0-31). There is extra shared loading activity going on here. The extra store activity and extra add arithmetic is irrelevant, because constraining these operations to less than a single warp is not any better performance-wise (this point is covered in the presentation itself, "We don’t need if (tid < s) because it doesn’t save any
work"). So the only consideration here is the once-per-step "extra" read of shared memory, one additional transaction, basically, per step. Against that we have the shifting, conditional test, and predication.
I don't know which is faster, but my guess as to the "logic" would be:
The difference would be small. Shared memory pressure is unlikely to be an issue at this point in this code.
The person who wrote it either didn't consider this at all, or considered it and decided it was probably so trivial as to be not worthy of cluttering a presentation that is really focused on other things, and will be read by many people.
EDIT: Based on comments, there appears to still be some question about my claim that the behavioral difference does not affect the results calculdation for sdata[0].
First, let's acknowledge that the only item we care about at the end is sdata[0]. sdata[1] or any other "result" is irrelevant for this discussion.
Let's make an observation about which thread calculations matter, at each step. We can observe that at a given step in the final-warp reduction, the only threads that matter (i.e. that can have an effect on the final value in sdata[0]) are those that are less then the offset value:
sdata[tid] += sdata[tid + offset]; // where offset is 32, then 16, then 8, etc.
Why is this? In order to understand that, we need to understand 2 things. First, we must understand at this point that there is an expectation of warp-synchronous behavior. This is already identified in the presentation (slide 21) as a necessary precondition to convert the loop reduction to the unrolled final warp reduction. I'm not going to spend a lot of time on the definition of warp-synchronous, but it essentially means we are depending on the warp to execute in lockstep. A warp is 32 threads, and it means that when one thread is executing a particular instruction, every thread in the warp is executing that instruction, at that point in the instruction stream. Second, we need to carefully decompose the above line to understand the sequence of operations. The above line of C++ code will decompose into the following pseudo-machine-language code that the GPU is actually executing:
LD R0, sdata[tid]
LD R1, sdata[tid+offset]
ADD R3, R2, R1
ST sdata[tid], R3
In english, at each step in the final warp unrolled reduction, each thread will load its sdata[tid] value, then each thread will load its sdata[tid+offset] value, then each thread will add those 2 values together, then each thread will store the result. Because the warp is executing in lockstep at this point, when each thread loads its sdata[tid] value, it means that every thread is loading its respective value, at that instruction cycle/clock cycle, i.e. at that instant.
now, lets revisit the overall operation. At the point in the sequence where we have:
sdata[tid] += sdata[tid + 16];
how can we justify the statement that the only threads here that matter are those whose tid value is less than the offset? The first thing each thread does is load sdata[tid]. Then each thread loads sdata[tid+16]. So at this point, threads 0-15 have loaded their own value, plus the values from locations 16-31. Threads 16-31 have loaded their own value, plus the values from locations 32-47. Then all 32 threads perform the addition, then all 32 threads perform the store operation. So thread 16, which also picked up the value from location 32, did not update the location 16 value until after the previous value at location 16 had been consumed (by thread 0 in this case). So the behavior of threads 16-31 at this point have no impact on the value computed for thread 0.
We can repeat the above process to show that for each offset, the threads whose indexes lie at or above the offset have no impact on the calculation for thread 0.

Multi-threaded multi GPU computation using openMP and openACC

I'm trying to write a code that will port openmp thread to a single gpu. I found very less case studies /codes on this.Since I`m not from computer science background.
I have less skills in programming.
This is how the basic idea look's like
And this is the code so far developed.
CALL OMP_SET_NUM_THREADS(2)
!$omp parallel num_threads(acc_get_num_devices(acc_device_nvidia))
do while ( num.gt.iteration)
id = omp_get_thread_num()
call acc_set_device_num(id+1, acc_device_nvidia)
!!$acc kernels
!error=0.0_rk
!!$omp do
!$acc kernels
!!$omp do
do j=2,nj-1
!!$acc kernels
do i=2,ni-1
T(i,j)=0.25*(T_o(i+1,j)+T_o(i-1,j)+ T_o(i,j+1)+T_o(i,j-1) )
enddo
!!$acc end kernels
enddo
!!$omp end do
!$acc end kernels
!!$acc update host(T,T_o)
error=0.0_rk
do j=2,nj-1
do i=2,ni-1
error = max( abs(T(i,j) - T_o(i,j)), error)
T_o(i,j) = T(i,j)
enddo
enddo
!!$acc end kernels
!!$acc update host(T,T_o,error)
iteration = iteration+1
print*,iteration , error
!print*,id
enddo
!$omp end parallel

There's a number of issues here.
First, you can't put an OpenMP (or OpenACC) parallel loop on a do while. Do while have indeterminant number to iterations therefor create a dependency in that exiting the loop depends on the previous iteration of the loop. You need to use a DO loop where the number of iterations is known upon entry into the loop.
Second, even if you convert this to a DO loop, you'd get a race condition if run in parallel. Each OpenMP thread would be assigning values to the same elements of the T and T_o arrays. Plus the results of T_o is used as input to the next iteration creating a dependency. In other words, you'd get wrong answers if you tried to parallelize the outer iteration loop.
For the OpenACC code, I'd suggest adding a data region around the iteration loop, i.e. "!$acc data copy(T,T_o) " before the iteration loop and then after the loop "!$acc end data", so that the data is created on the device only once. As you have it now, the data would be implicitly created and copied each time through the iteration loop causing unnecessary data movement. Also add a kernels region around the max error reduction loop so this is offloaded as well.
In general, I prefer using MPI+OpenCC for multi-GPU programming rather than OpenMP. With MPI, the domain decomposition is inherent and you then have a one-to-one mapping of MPI rank to a device. Not that OpenMP can't work, but you then often need to manually decompose the domain. Also trying to manage multiple device memories and keep them in sync can be tricky. Plus with MPI, your code can also go across nodes rather than be limited to a single node.

What's the purpose of clocked registers in pipelined processor

Hi I'm reading an textbook that descrbes the piplelined desgin of CPU.
I don't understand why we still need clocked registers? for example, as the picture belows shows:
if we can remove all three registers, we can save 60ps, because we just need the processor to continuely execute instructions, so when
a comb logic finishes, that's when the next instruction should start to execute, why we need clock cycle to manually control the beginning of executing instructions?

You can begin to understand the need for latches by imagining that they are removed.
The secret is to realize that it takes each block 100 picoseconds to produce valid results. Before that time, the output is invalid, aka junk and not as you might think, the previous result. Remember, these a combinatorial blocks that have no memory.
Now imagine that we place new data on the inputs of Block A every 100 picoseconds.
What will the output look like? Well as soon as the new data is presented to the inputs, the outputs of that block are invalid. This means that Block B has invalid inputs and cannot begin processing data until they are valid.
Now after 100 picoseconds, Block A has valid data going out and Block B can finally begin. But no, the input to Block A changes and Block B has invalid inputs again. The only way to get a valid result through all three is to hold the inputs valid for the whole 300 picoseconds needed to get through all three blocks.
With latches, the valid results from each block are latched and do not change with changing inputs. Thus we can present new data every 100 + 20 picoseconds versus every 300 picoseconds. Or, with pipeline latches the circuit runs 2.5 times faster.

Why loop is so faster?

The following loop in fortran almost takes no time
j=0
do i=1,1000000000000000000
j=j+1
end do
print*,j
But I just don't understand, our cpu is about GHz, which means 10^9 cycle in a second, while the above loop cycle is way too much than 10^9, why it almost takes no time?
It seems that the values is not computed at compiled time. We can add outer loop, until
do m=1,1000000000
do i=1,1000000000000000000
j=j+1
end do
end do
print*,j
Now it takes a second on my computer
Edit
I am using windows, intel parallel studio 15, with no extra compilation option: simply ifort test.f90. Timing method is simple, just wait after I press Enter in command line to execute the .exe

don't know fortran, but if this would be C, the compiler could optimize the above code removing the loop altogether as the value of j can be computed at compile time.
So the above code would be reduced to
print 1000000000000000000
Your logic about cycles and instructions is flawed. Modern CPUs parallelize code on hardware level, even if the code is serial:
a cpu has more a few ALU who can compute arithmetic instructions in parallel
instructions are executed in a pipeline, so at any one point, different stages of consecutive instructions are executed in parallel.
So "max of one instruction per cycle" doesn't hold.
Also increment by one is one of the fastest instruction in the CPU.

Measuring execution time of selected loops

I want to measure the running times of selected loops in a C program so as to see what percentage of the total time for executing the program (on linux) is spent in these loops. I should be able to specify the loops for which the performance should be measured. I have tried out several tools (vtune, hpctoolkit, oprofile) in the last few days and none of them seem to do this. They all find the performance bottlenecks and just show the time for those. Thats because these tools only store the time taken that is above a threshold (~1ms). So if one loop takes lesser time than that then its execution time won't be reported.
The basic block counting feature of gprof depends on a feature in older compilers thats not supported now.
I could manually write a simple timer using gettimeofday or something like that but for some cases it won't give accurate results. For ex:
for (i = 0; i < 1000; ++i)
{
for (j = 0; j < N; ++j)
{
//do some work here
}
}
Now here I want to measure the total time spent in the inner loop and I will have to put a call to gettimeofday inside the first loop. So gettimeofday itself will get called a 1000 times which introduces its own overhead and the result will be inaccurate.

Unless you have an in circuit emulator or break-out box around your CPU, there's no such thing as timing a single-loop or single-instruction. You need to bulk up your test runs to something that takes at least several seconds each in order to reduce error due to other things going on in the CPU, OS, etc.
If you're wanting to find out exactly how much time a particular loop takes to execute, and it takes less than, say, 1 second to execute, you're going to need to artificially increase the number of iterations in order to get a number that is above the "noise floor". You can then take that number and divide it by the number of artificially inflated iterations to get a figure that represents how long one pass through your target loop will take.
If you're wanting to compare the performance of different loop styles or techniques, the same thing holds: you're going to need to increase the number of iterations or passes through your test code in order to get a measurement in which what you're interested in dominates the time slice you're measuring.
This is true whether you're measuring performance using sub-millisecond high performance counters provided by the CPU, the system date time clock, or a wall clock to measure the elapsed time of your test.
Otherwise, you're just measuring white noise.

Typically if you want to measure the time spent in the inner loop, you'll put the time get routines outside of the outer loop and then divide by the (outer) loop count. If you expect the time of the inner loop to be relatively constant for any j, that is.
Any profiling instructions incur their own overhead, but presumably the overhead will be the same regardless of where it's inserted so "it all comes out in the wash." Presumably you're looking for spots where there are considerable differences between the runtimes of two compared processes, where a pair of function calls like this won't be an issue (since you need one at the "end" too, to get the time delta) since one routine will be 2x or more costly over the other.
Most platforms offer some sort of higher resolution timer, too, although the one we use here is hidden behind an API so that the "client" code is cross-platform. I'm sure with a little looking you can turn it up. Although even here, there's little likelihood that you'll get better than 1ms accuracy, so it's preferable to run the code several times in a row and time the whole run (then divide by the loop count, natch).

I'm glad you're looking for percentage, because that's easy to get. Just get it running. If it runs quickly, put an outer loop around it so it takes a good long time. That won't affect the percentages. While it's running, get stackshots. You can do this with Ctrl-Break in gdb, or you can use pstack or lsstack. Just look to see what percentage of stackshots display the code you care about.
Suppose the loops take some fraction of time, like 0.2 (20%) and you take N=20 samples. Then the number of samples that should show them will average 20 * 0.2 = 4, and the standard deviation of the number of samples will be sqrt(20 * 0.2 * 0.8) = sqrt(3.2) = 1.8, so if you want more precision, take more samples. (I personally think precision is overrated.)

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio