I read from the Kepler architecture tech brief that dynamic parallelism, newly added in CUDA 5.0, supports recursion and irregular loop structure in programming patterns.
But could anybody tell me what irregular loop structure is?
According to this book on page 146 , (which is specifically addressing implementing kernels),
A regular loop has a definite number of iterations, while the number of iterations of an irregular loop depends on certain conditions.
They also provide some examples here:
Regular loop
for (int i=0; i < 10; i++)
{
//...;
}
Irregular loop
while (i < 0)
{
if (con)
{i--;}
else
//...;
i++;
}
Another Irregular loop
while (true)
{
if (cond1)
{break;}
else
{
//...;
if (cond2)
{break;}
}
}
Just to be clear, support for irregular loops within a kernel has always existed. Instead, they are suggesting that in CUDA 5.0 you can now write GPU code that more closely mimics recursive or irregularly looping algorithms by using the dynamic parallelism feature. Used correctly, this feature could allow you to implement solutions that avoid warp divergence by launching child kernels.
Related
Firstly, the question might be slightly misleading, I understand the main differences between the collapse clause in a parallel region and a region without one. Let's say I want to transpose a matrix and there are the following two methods, First a parallel for with SIMD directive for the inner loop and a second method using the collapse(2) clause
#pragma omp parallel for
for(int i=0; i<rows; i++){
#pragma omp simd
for(int j=0; j<columns; j++){
*(output + j * rows + i) = *(input + i * columns + j);
}
}
#pragma omp parallel for collapse(2)
for(int i=0; i<rows; i++){
for(int j=0; j<columns; j++){
*(output + j * rows + i) = *(input + i * columns + j);
}
In the two methods above, which would be more efficient especially in terms of caching?.
Out of the two above, which of the implementations would be more efficient and faster? Is there any way we can ascertain that just by looking at the implementations.
And Given all the loop counters are independent of each other, can one set a basic guideline as to when to use when?
TIA
TL;DR: both implementations are quite inefficient. The second one will likely be slower than the first in practice, although it could theoretically scale better.
The first implementation is unlikely to be vectorized because the accesses are not contiguous in memory. Both GCC 10 and Clang 11 generate inefficient code.
The point is that OpenMP provide no high-level SIMD construct to deal with data transposition! Thus, if you want to do it efficiently, you probably need to get your hands dirty by doing it yourself (or by using an external library that does it for you).
The second implementation could be significantly slower than the first implementation because the loop iterator is linearized often resulting in more instruction to be executed in the hot path. Some implementation (eg. Clang 11 and ICC 19 but not GCC 10) even use a very slow modulus operation (ie. div instruction) to do so resulting in a much slower loop.
The second implementation should also theoretically scale better than the first because the collapse clause provide more parallelism. Indeed, in the first implementation, there is only rows lines to share between n threads. So, if you work on massively parallel machines or wide rectangular matrices, with n not so small compared to rows, this could cause some work imbalance, or even thread starvation.
Why both implementations are inefficient
The two implementations are inefficient because of the memory access pattern. Indeed, on big matrices, writes in output are not contiguous and will cause many cache misses. A full cache line (64 bytes on most common architectures) will be written while only few bytes will be written into it. If columns is a power of two, cache thrashing will occurs and further decrease performance.
One solution to mitigate these issues is to use tiling. Here is an example:
// Assume rows and columns are nice for sake of clarity ;)
constexpr int tileSize = 8;
assert(rows % tileSize == 0);
assert(columns % tileSize == 0);
// Note the collapse clause is needed here for scalability and
// the collapse overhead is mitigated by the inner loop.
#pragma omp parallel for collapse(2)
for(int i=0; i<rows; i+=tileSize)
{
for(int j=0; j<columns; j+=tileSize)
{
for(int ti=i; ti<i+tileSize; ++ti)
{
for(int tj=j; tj<j+tileSize; ++tj)
{
output[tj * rows + ti] = input[ti * columns + tj];
}
}
}
}
The above code should be faster, but not optimal. Successfully writing a fast transposition code is challenging. Here is some advises to improve the code:
use a temporary tile buffer to improve the memory access pattern (so the compiler can use fast SIMD instructions)
use square tiles to improve the use of the cache
use multi-level tiling to improve the use of the L2/L3 cache or use a Z-tiling approach
Alternatively, you can simply use a fast BLAS implementation providing matrix transposition functions quite well optimized (not all do, but AFAIK OpenBLAS and the MKL does).
PS: I assumed matrices are stored in a row-major order.
We know that while and for are the most frequently used loop syntax for many programming and scripting languages. Here I would like to ask some questions regarding convertibility and feasibility of using while vs for loop.
Is for to while and vice versa transformation or conversion always possible? I mean suppose one used while loop for some functionality and I want to replace while with for or say vice-versa, then Is while ⇋ for transformation/conversion always possible (also interested in knowing the feasibility)? It would be helpful of I can refer If any research regarding this carried out.
I'm also interested in getting the general guidance for using while vs for. Also want to know if while has some advantages over for and vice versa.
Note: I've this question for log time, I thought -- being a great programming site, this question can be useful here. If the question is not suitable here. I'm unsure if this question is acceptable here, so requesting to consider it liberal; you can ask me to remove if such question hurts the quality of site :)
I will answer using Java as a reference, though this answer should also be completely valid for C, C#, C++ and many others. If we consider the following for loop:
for (int i=0; i < 10; ++i) {
// do something, maybe involving i
}
We can see that the loop has 3 components:
int i=0; initialization of loop counter
i < 10 criteria for loop to execute
++i increment to loop counter
The following while loop is functionally equivalent to the above for loop:
int i=0;
while (i < 10) {
// do something, maybe involving i
++i;
}
We can see that the main difference between this while loop and the for loop are that the declaration and initialization of the loop counter is outside the loop in the former case. Also, we increment the loop counter inside the actual while loop. The check for the loop continuing is still done inside the loop structure, as with for loops.
So a for loop can be thought of an enhanced while loop of sorts. It frees us from having to create a loop counter outside the loop, and also we can increment/change the loop counter within the loop structure, rather than mixing such logic with the code of the loop body.
I am trying to figure out if and how a specific existing code can be parallelized for use in an ARM Cortex-A9 NEON SIMD unit. This is the code:
for(int i=0; i < 11; i++)
{
f4UF1 *= F[i];
A[i][2] = A[i][1];
A[i][1] = A[i][0];
A[i][0] = f4UF1;
B[i][2] = B[i][1];
B[i][1] = B[i][0];
C[i] = 0;
C[i] += D[i][0] * A[i][0];
C[i] += D[i][1] * A[i][1];
C[i] += D[i][2] * A[i][2];
C[i] -= E[i][1] * B[i][1];
C[i] -= E[i][2] * B[i][2];
B[i][0] = C[i] / E[i][0];
f4UF1 = B[i][0];
}
I have looked at the code for quite a bit now and I am almost sure that it cannot be parallelized efficiently, but I thought, I could give it a try to ask here. I am not expecting ready code, just ideas on how to do it. Thanks :)
So yes, this does look like a biquad for which the coefficients are changed for each sample, perhaps because you are smoothing them.
As a commenter mentioned, you probably want to pre-compute the 1/E[i][0] scaling factor and perhaps roll it into the other coefficients to reduce the number of multiplies, especially on floating point platforms. You can also often normalize the biquad to get rid of the D[i][0] as well (making it 1.0), and just apply a scalar to the whole output.
And of course, you probably have realized that you want to keep everything in registers during the loop and then only write them out to memory after the loop is done... ;-)
After that, there are two vectorization techniques that I'm aware of (though I'm interested in Nils' ideas as well):
Channel vectorization - the easiest. If you need to apply filters to multiple data sets at once (very common for stereo audio for example), you can operate two sets of coefficients with two sets of audio data at the same time. I've found that Neon provides just about the right number of registers for two channels if you are using all SP floating point. Instant 2x speedup really.
Loop unrolling. This gets a little tricky to describe in detail here, but fortunately there is a nice page here: http://reanimator-web.appspot.com/articles/simdiir. This technique adds pole/zero pairs to essentially compute more samples at once. However, the extra poles of course add extra conditions to the stability of the filter and so you have to be careful. In your case, when the coefficients seem to be dynamic, this is probably some kind of nightmare to ensure.
I need to use Fortran instead of C somewhere and I am very new to Fortran. I am trying to do some big calculations but it is quite slow comparing to C (maybe 10x or more and I am using Intel's compilers for both). I think the reason is Fortran keeps the matrix in column major format, and I am trying to do operations like sum(matrix(i, j, :)), because it is column major, probably this uses the cache very inefficiently (probably not using at all). However, I am not sure if this is the actual reason (since I know so less about Fortran). Question is, the convention in Fortran is to do operations on column vectors instead of row vectors ?
(BTW: I checked the speed of Fortran already using Intel's LAPACK libraries, and it is quite fast, so it is not related to any compiler or build issue.)
Thanks.
Mete
Try changing the order of your loops when doing matrix operations, e.g. if you have something like this in C:
for (i = 0; i < M; ++i) // for each row
{
for (j = 0; j < N; ++j) // for each col
{
// matrix operations on e.g. A[i][j]
}
}
then in Fortran you want the j (column) loop as the outer loop and the i (row) loop as the inner loop.
An alternative approach, which achieves the same thing, is to keep the loops as they are but change the definition of the array, e.g. if in C it's A[x][y][z][t] then in FORTRAN make it A[t][z][y][x], assuming that t is the fastest varying loop index, and x the slowest.
Since, as you write, Fortran is column major with the first index varying fastest in memory layout, so sum(matrix(i, j, :)) causes the summation of non-contiguous locations. If this is really the cause of slower operation, then you could redefine your matrix to have a different order of dimensions so that the current 3rd dimension is the 1st. Yes, if this is your main computation, rearrange the matrix to make the summation a column operation. Explicit looping should be as earlier indices fastest, as described by #PaulR. If you had previously thought of the optimum index order for C and are changing to Fortran, this is one aspect that might need changing. But while this is theoretically true, I doubt that it really matters that much in practice, unless perhaps the array is enormous. (The worse case would be that part of the array is in RAM and part in swap on disk!) The first rule about run-time speed issues is don't guess ... measure. It is usually the algorithm.
I've been trying to optimize some extremely performance-critical code (a quick sort algorithm that's being called millions and millions of times inside a monte carlo simulation) by loop unrolling. Here's the inner loop I'm trying to speed up:
// Search for elements to swap.
while(myArray[++index1] < pivot) {}
while(pivot < myArray[--index2]) {}
I tried unrolling to something like:
while(true) {
if(myArray[++index1] < pivot) break;
if(myArray[++index1] < pivot) break;
// More unrolling
}
while(true) {
if(pivot < myArray[--index2]) break;
if(pivot < myArray[--index2]) break;
// More unrolling
}
This made absolutely no difference so I changed it back to the more readable form. I've had similar experiences other times I've tried loop unrolling. Given the quality of branch predictors on modern hardware, when, if ever, is loop unrolling still a useful optimization?
Loop unrolling makes sense if you can break dependency chains. This gives a out of order or super-scalar CPU the possibility to schedule things better and thus run faster.
A simple example:
for (int i=0; i<n; i++)
{
sum += data[i];
}
Here the dependency chain of the arguments is very short. If you get a stall because you have a cache-miss on the data-array the cpu cannot do anything but to wait.
On the other hand this code:
for (int i=0; i<n-3; i+=4) // note the n-3 bound for starting i + 0..3
{
sum1 += data[i+0];
sum2 += data[i+1];
sum3 += data[i+2];
sum4 += data[i+3];
}
sum = sum1 + sum2 + sum3 + sum4;
// if n%4 != 0, handle final 0..3 elements with a rolled up loop or whatever
could run faster. If you get a cache miss or other stall in one calculation there are still three other dependency chains that don't depend on the stall. A out of order CPU can execute these in parallel.
(See Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators) for an in-depth look at how register-renaming helps CPUs find that parallelism, and an in depth look at the details for FP dot-product on modern x86-64 CPUs with their throughput vs. latency characteristics for pipelined floating-point SIMD FMA ALUs. Hiding latency of FP addition or FMA is a major benefit to multiple accumulators, since latencies are longer than integer but SIMD throughput is often similar.)
Those wouldn't make any difference because you're doing the same number of comparisons. Here's a better example. Instead of:
for (int i=0; i<200; i++) {
doStuff();
}
write:
for (int i=0; i<50; i++) {
doStuff();
doStuff();
doStuff();
doStuff();
}
Even then it almost certainly won't matter but you are now doing 50 comparisons instead of 200 (imagine the comparison is more complex).
Manual loop unrolling in general is largely an artifact of history however. It's another of the growing list of things that a good compiler will do for you when it matters. For example, most people don't bother to write x <<= 1 or x += x instead of x *= 2. You just write x *= 2 and the compiler will optimize it for you to whatever is best.
Basically there's increasingly less need to second-guess your compiler.
Regardless of branch prediction on modern hardware, most compilers do loop unrolling for you anyway.
It would be worthwhile finding out how much optimizations your compiler does for you.
I found Felix von Leitner's presentation very enlightening on the subject. I recommend you read it. Summary: Modern compilers are VERY clever, so hand optimizations are almost never effective.
As far as I understand it, modern compilers already unroll loops where appropriate - an example being gcc, if passed the optimisation flags it the manual says it will:
Unroll loops whose number of
iterations can be determined at
compile time or upon entry to the
loop.
So, in practice it's likely that your compiler will do the trivial cases for you. It's up to you therefore to make sure that as many as possible of your loops are easy for the compiler to determine how many iterations will be needed.
Loop unrolling, whether it's hand unrolling or compiler unrolling, can often be counter-productive, particularly with more recent x86 CPUs (Core 2, Core i7). Bottom line: benchmark your code with and without loop unrolling on whatever CPUs you plan to deploy this code on.
Trying without knowing is not the way to do it.
Does this sort take a high percentage of overall time?
All loop unrolling does is reduce the loop overhead of incrementing/decrementing, comparing for the stop condition, and jumping. If what you're doing in the loop takes more instruction cycles than the loop overhead itself, you're not going to see much improvement percentage-wise.
Here's an example of how to get maximum performance.
Loop unrolling can be helpful in specific cases. The only gain isn't skipping some tests!
It can for instance allow scalar replacement, efficient insertion of software prefetching... You would be surprised actually how useful it can be (you can easily get 10% speedup on most loops even with -O3) by aggressively unrolling.
As it was said before though, it depends a lot on the loop and the compiler and experiment is necessary. It's hard to make a rule (or the compiler heuristic for unrolling would be perfect)
Loop unrolling entirely depends on your problem size. It is entirely dependent on your algorithm being able to reduce the size into smaller groups of work. What you did above does not look like that. I am not sure if a monte carlo simulation can even be unrolled.
I good scenario for loop unrolling would be rotating an image. Since you could rotate separate groups of work. To get this to work you would have to reduce the number of iterations.
Loop unrolling is still useful if there are a lot of local variables both in and with the loop. To reuse those registers more instead of saving one for the loop index.
In your example, you use small amount of local variables, not overusing the registers.
Comparison (to loop end) are also a major drawback if the comparison is heavy (i.e non-test instruction), especially if it depends on an external function.
Loop unrolling helps increasing the CPU's awareness for branch prediction as well, but those occur anyway.