I am trying to parallelize this loops, but get some error in PGI compiler, I don't understand what's wrong
#pragma acc kernels
{
#pragma acc loop independent
for (i = 0;i < k; i++)
{
for(;dt*j <= Ms[i+1].t;j++)
{
w = (j*dt - Ms[i].t)/(Ms[i+1].t-Ms[i].t);
X[j] = Ms[i].x*(1-w)+Ms[i+1].x*w;
Y[j] = Ms[i].y*(1-w)+Ms[i+1].y*w;
}
}
}
Error
85, Generating Multicore code
87, #pragma acc loop gang
89, Accelerator restriction: size of the GPU copy of Y,X is unknown
Complex loop carried dependence of Ms->t,Ms->x,X->,Ms->y,Y-> prevents parallelization
Loop carried reuse of Y->,X-> prevents parallelization
So what i can do to solve this dependence problem?
I see a few issues here. Also given the output, I'm assuming that you're compiling with "-ta=multicore,tesla" (i.e. targeting both a multicore CPU and a GPU)
First, since "j" is not initialized in the "i" loop, the starting value of "j" will depended on the ending value of "j" from the previous iteration of "i". Hence, the loops are not parallelizable. By using "loop independent", you have forced parallelization on the outer loop, but you will get differing answers from running the code sequentially. You will need to rethink your algorithm.
I would suggest making X and Y two dimensional. With the first dimension of size "k". The second dimension can be a jagged array (i.e. each having a differing size) with the size corresponding to the "Ms[i+1].t" value.
I wrote an example of using jagged arrays as part of my Chapter (#5) of the Parallel Programming with OpenACC book. See: https://github.com/rmfarber/ParallelProgrammingWithOpenACC/blob/master/Chapter05/jagged_array.c
Alternatively, you might be able to set "j=Ms[i].t" assuming "Ms[0].t" is set.
for(j=Ms[i].t;dt*j <= Ms[i+1].t;j++)
"Accelerator restriction: size of the GPU copy of Y,X is unknown"
This is telling you that the compiler can not implicitly copy the X and Y arrays on the device. In C/C++, unbounded pointers don't have sizes so the compiler can't tell how big these arrays are. Often it can derive this information from the loop trip counts, but since the loop trip count is unknown (see above), it can't in this case. To fix, you need to include a data directive on the "kernels" directive or add a data region to your code. For example:
#pragma acc kernels copyout(X[0:size], Y[0:size])
or
#pragma acc data copyout(X[0:size], Y[0:size])
{
...
#pragma acc kernels
...
}
Another thing to keep in mind is pointer aliasing. In C/C++, pointers of the same type are allowed to point at the same object. Hence, without additional information such as the "restrict" attribute, the "independent" clause, or the PGI compiler flag "-Msafeptr", the compiler must assume your pointers do point to the same object making the loop not parallelizable.
This would most likely go away by either adding loop independent to the inner loop as well or using the collapse clause to flatted the loop, applying independent to both. Might also go away if all of your arrays are passed in using restrict, but maybe not.
Related
I'm having trouble in efficiently parallelizing the next line of code:
# pragma omp for nowait
for (int i = 0; i < M; i++) {
# pragma omp atomic
centroids[points[i].cluster].points_in_cluster++;
}
This runs, I guess due to the omp for overhead, slower than this:
# pragma omp single nowait
for (int i = 0; i < M; i++) {
centroids[points[i].cluster].points_in_cluster++;
}
Is there any way to make this go faster?
Theory
While atomics are certainly better than locks or critical regions due to their implementation in hardware on most platforms, they are still in general to be avoided if possible as they do not scale well, i.e. increasing the number of threads will create more atomic collisions and therefore more overhead. Further hardware-/implementation-specific bottlenecks due to atomics are described in the comments below the question and this answer by #PeterCordes.
The alternative to atomics is a parallel reduction algorithm. Assuming that there are much more points than centroids, one can use OpenMP's reduction clause to let every thread have a private version of centroids. These private histograms will be consolidated in an implementation-defined fashion after filling them.
There is no guarantee that this technique is faster than using atomics in every possible case. It could not only depend on the size of the two index spaces, but also on the data as it determines the number of collisions when using atomics. A proper parallel reduction algorithm is in general still expected to scale better to big numbers of threads.
Practice
The problem with using reduce in your code is the Array-of-Structs (AoS) data layout. Specifying
# pragma omp for reduction(+: centroids[0:num_centroids])
will produce an error at build time, as the compiler does not know how to reduce the user-defined type of centroids. Specifying
# pragma omp for reduction(+: centroids[0:num_centroids].points_in_cluster)
does not work either as it is not a valid OpenMP array section.
One can try to use an custom reduction here, but I do not know how to combine a user-defined reduction with OpenMP array sections (see the edit at the end). Also it could be very inefficient to create all the unused variables in the centroid struct on every thread.
With a Struct-of-Array (SoA) data layout you would just have a plain integer buffer, e.g. int *points_in_clusters, which could then be used in the following way (assuming that there are num_centroids elements in centroids and now points_in_clusters):
# pragma omp for nowait reduction(+: points_in_clusters[0:num_centroids])
for (int i = 0; i < M; i++) {
points_in_clusters[points[i].cluster]++;
}
If you cannot just change the data layout, you could still use some scratch space for the OpenMP reduction and afterwards copy the results back to the centroid structs in another loop. But this additional copy operation could eat into the savings from using reduction in the first place.
Using SoA also has benefits for (auto-) vectorization (of other loops) and potentially improves cache locality for regular access patterns. AoS on the other hand can be better for cache locality when encountering random access patterns (e.g. most sorting algorithms if the comparison makes use of multiple variables from the struct).
PS: Be careful with nowait. Does the following work really not depend on the resulting points_in_cluster?
EDIT: I removed my alternative implementation using a user-defined reduction operator as it was not working. I seem to have fixed the problem, but I do not have enough confidence in this implementation (performance- and correctness-wise) to add it back into the answer. Feel free to improve upon the linked code and post another answer.
I want to apply a polynomial of small degree (2-5) to a vector of whose length can be between 50 and 3000, and do this as efficiently as possible.
Example: For example, we can take the function: (1+x^2)^3, when x>3 and 0 when x<=3.
Such a function would be executed 100k times for vectors of double elements. The size of each vector can be anything between 50 and 3000.
One idea would be to use Eigen:
Eigen::ArrayXd v;
then simply apply a functor:
v.unaryExpr([&](double x) {return x>3 ? std::pow((1+x*x), 3.00) : 0.00;});
Trying with both GCC 9 and GCC 10, I saw that this loop is not being vectorized. I did vectorize it manually, only to see that the gain is much smaller than I expected (1.5x). I also replaced the conditioning with logical AND instructions, basically executing both branches and zeroing out the result when x<=3. I presume that the gain came mostly from the lack of branch misprediction.
Some considerations
There are multiple factors at play. First of all, there are RAW dependencies in my code (using intrinsics). I am not sure how this affects the computation. I wrote my code with AVX2 so I was expecting a 4x gain. I presume that this plays a role, but I cannot be sure, as the CPU has out-of-order-processing. Another problem is that I am unsure if the performance of the loop I am trying to write is bound by the memory bandwidth.
Question
How can I determine if either the memory bandwidth or pipeline hazards are affecting the implementation of this loop? Where can I learn techniques to better vectorize this loop? Are there good tools for this in Eigenr MSVC or Linux? I am using an AMD CPU as opposed to Intel.
You can fix the GCC missed optimization with -fno-trapping-math, which should really be the default because -ftrapping-math doesn't even fully work. It auto-vectorizes just fine with that option: https://godbolt.org/z/zfKjjq.
#include <stdlib.h>
void foo(double *arr, size_t n) {
for (size_t i=0 ; i<n ; i++){
double &tmp = arr[i];
double sqrp1 = 1.0 + tmp*tmp;
tmp = tmp>3 ? sqrp1*sqrp1*sqrp1 : 0;
}
}
It's avoiding the multiplies in one side of the ternary because they could raise FP exceptions that C++ abstract machine wouldn't.
You'd hope that writing it with the cubing outside a ternary should let GCC auto-vectorize, because none of the FP math operations are conditional in the source. But it doesn't actually help: https://godbolt.org/z/c7Ms9G GCC's default -ftrapping-math still decides to branch on the input to avoid all the FP computation, potentially not raising an overflow (to infinity) exception that the C++ abstract machine would have raised. Or invalid if the input was NaN. This is the kind of thing I meant about -ftrapping-math not working. (related: How to force GCC to assume that a floating-point expression is non-negative?)
Clang also has no problem: https://godbolt.org/z/KvM9fh
I'd suggest using clang -O3 -march=native -ffp-contract=fast to get FMAs across statements when FMA is available.
(In this case, -ffp-contract=on is sufficient to contract 1.0 + tmp*tmp within that one expression, but not across statements if you need to avoid that for Kahan summation for example. The clang default is apparently -ffp-contract=off, giving separate mulpd and addpd)
Of course you'll want to avoid std::pow with a small integer exponent. Compilers might not optimize that into just 2 multiplies and instead call a full pow function.
Say that I have a construct like this:
for(int i=0;i<5000;i++){
const int upper_bound = f(i);
#pragma acc parallel loop
for(int j=0;j<upper_bound;j++){
//Do work...
}
}
Where f is a monotonically-decreasing function of i.
Since num_gangs, num_workers, and vector_length are not set, OpenACC chooses what it thinks is an appropriate scheduling.
But does it choose such a scheduling afresh each time it encounters the pragma, or only once the first time the pragma is encountered?
Looking at the output of PGI_ACC_TIME suggests that scheduling is only performed once.
The PGI compiler will choose how to decompose the work at compile-time, but will generally determine the number of gangs at runtime. Gangs are inherently scalable parallelism, so the decision on how many can be deferred until runtime. The vector length and number of workers affects how the underlying kernel gets generated, so they're generally selected at compile-time to maximize optimization opportunities. With loops like these, where the bounds aren't really known at compile-time, the compiler has to generate some extra code in the kernel to ensure exactly the correct number of iterations are performed.
According to OpenAcc 2.6 specification[1] Line 1357 and 1358:
A loop associated with a loop construct that does not have a seq clause must be written such that the loop iteration count is computable when entering the loop construct.
Which seems to be the case, so your code is valid.
However, note it is implementation defined how to distribute the work among the gangs and workers, and it may be that the PGI compiler is simply doing some simple partitioning of the iterations.
You could manually define values of gang/workers using num_gangs and num_workers, and the integer expression passed to those clauses can depend on the value of your function (See 2.5.7 and 2.5.8 on OpenACC specification).
[1] https://www.openacc.org/sites/default/files/inline-files/OpenACC.2.6.final.pdf
I'm faced with parallelizing an algorithm which in its serial implementation examines the six faces of a cube of array locations within a much larger three dimensional array. (That is, select an array element, and then define a cube or cuboid around that element 'n' elements distant in x, y, and z, bounded by the bounds of the array.
Each work unit looks something like this (Fortran pseudocode; the serial algorithm is in Fortran):
do n1=nlo,nhi
do o1=olo,ohi
if (somecondition(n1,o1) .eq. .TRUE.) then
retval =.TRUE.
RETURN
endif
end do
end do
Or C pseudocode:
for (n1=nlo,n1<=nhi,n++) {
for (o1=olo,o1<=ohi,o++) {
if(somecondition(n1,o1)!=0) {
return (bool)true;
}
}
}
There are six work units like this in the total algorithm, where the 'lo' and 'hi' values generally range between 10 and 300.
What I think would be best would be to schedule six or more threads of execution, round-robin if there aren't that many CPU cores, ideally with the loops executing in parallel, with the goal the same as the serial algorithm: somecondition() becomes True, execution among all the threads must immediately stop and a value of True set in a shared location.
What techniques exist in a Windows compiler to facilitate parallelizing tasks like this? Obviously, I need a master thread which waits on a semaphore or the completion of the worker threads, so there is a need for nesting and signaling, but my experience with OpenMP is introductory at this point.
Are there message passing mechanisms in OpenMP?
EDIT: If the highest difference between "nlo" and "nhi" or "olo" and "ohi" is eight to ten, that would imply no more than 64 to 100 iterations for this nested loop, and no more than 384 to 600 iterations for the six work units together. Based on that, is it worth parallelizing at all?
Would it be better to parallelize the loop over the array elements and leave this algorithm serial, with multiple threads running the algorithm on different array elements? I'm thinking this from your comment "The time consumption comes from the fact that every element in the array must be tested like this. The arrays commonly have between four million and twenty million elements." The design of implementing the parallelelization of the array elements is also flexible in terms of the number threads. Unless there is a reason that the array elements have to be checked in some order?
It seems that the portion that you are showing us doesn't take that long to execute so making it take less clock time by making it parallel might not be easy ... there is always some overhead to multiple threads, and if there is not much time to gain, parallel code might not be faster.
One possibility is to use OpenMP to parallelize over the 6 loops -- declare logical :: array(6), allow each loop to run to completion, and then retval = any(array). Then you can check this value and return outside the parallelized loop. Add a schedule(dynamic) to the parallel do statement if you do this. Or, have a separate !$omp parallel and then put !$omp do schedule(dynamic) ... !$omp end do nowait around each of the 6 loops.
Or, you can follow the good advice by #M.S.B. and parallelize the outermost loop over the whole array. The problem here is that you cannot have a RETURN inside a parallel loop -- so label the second outermost loop (the largest one within the parallel part), and EXIT that loop -- smth like
retval = .FALSE.
!$omp parallel do default(private) shared(BIGARRAY,retval) schedule(dynamic,1)
do k=1,NN
if(.not. retval) then
outer2: do j=1,NN
do i=1,NN
! --- your loop #1
do n1=nlo,nhi
do o1=olo,ohi
if (somecondition(BIGARRAY(i,j,k),n1,o1)) then
retval =.TRUE.
exit outer2
endif
end do
end do
! --- your loops #2 ... #6 go here
end do
end do outer2
end if
end do
!$omp end parallel do
[edit: the if statement is there presuming that you need to find out if there is at least one element like that in the big array. If you need to figure the condition for every element, you can similarly either add a dummy loop exit or goto, skipping the rest of the processing for that element. Again, use schedule(dynamic) or schedule(guided).]
As a separate point, you might also want to check if it may be a good idea to go through the innermost loop by some larger step (depending on float size), compute a vector of logicals on each iteration and then aggregate the results, eg. smth like if(count(somecondition(x(o1:o1+step,n1,k)))>0); in this case the compiler may be able to vectorize somecondition.
I believe you can do what you want with the task construct introduced in OpenMP 3; Intel Fortran supports tasking in OpenMP. I don't use tasks often so I won't offer you any wonky pseudocode.
You already mentioned the obvious way to stop all threads as soon as any thread finds the ending condition: have each check some shared variable which gives the status of the ending condition, thereby determining whether to break out of the loops. Obviously this is an overhead, so if you decide to take this approach I would suggest a few things:
Use atomics to check the ending condition, this avoids expensive memory flushing as just the variable in question is flushed. Move to OpenMP 3.1, there are some new atomic operations supported.
Check infrequently, maybe like once per outer iteration. You should only be parallelizing large cases to overcome the overhead of multithreading.
This one is optional, but you can try adding compiler hints, e.g. if you expect a certain condition to be false most of the time, the compiler will optimize the code accordingly.
Another (somewhat dirty) approach is to use shared variables for the loop ranges for each thread, maybe use a shared array where index n is for thread n. When one thread finds the ending condition, it changes the loop ranges of all the other threads so that they stop. You'll need the appropriate memory synchronization. Basically the overhead has now moved from checking a dummy variable to synchronizing/checking loop conditions. Again probably not so good to do this frequently, so maybe use shared outer loop variables and private inner loop variables.
On another note, this reminds me of the classic polling versus interrupt problem. Unfortunately I don't think OpenMP supports interrupts where you can send some kind of kill signal to each thread.
There are hacking work-arounds like using a child process for just this parallel work and invoking the operating system scheduler to emulate interrupts, however this is rather tricky to get correct and would make your code extremely unportable.
Update in response to comment:
Try something like this:
char shared_var = 0;
#pragma omp parallel
{
//you should have some method for setting loop ranges for each thread
for (n1=nlo; n1<=nhi; n1++) {
for (o1=olo; o1<=ohi; o1++) {
if (somecondition(n1,o1)!=0) {
#pragma omp atomic write
shared_var = 1; //done marker, this will also trigger the other break below
break; //could instead use goto to break out of both loops in 1 go
}
}
#pragma omp atomic read
private_var = shared_var;
if (private_var!=0) break;
}
}
A suitable parallel approach might be, to let each worker examine a part of the overall problem, exactly as in the serial case and use a local (non-shared) variable for the result (retval). Finally do a reduction over all workers on these local variables into a shared overall result.
I need to use Fortran instead of C somewhere and I am very new to Fortran. I am trying to do some big calculations but it is quite slow comparing to C (maybe 10x or more and I am using Intel's compilers for both). I think the reason is Fortran keeps the matrix in column major format, and I am trying to do operations like sum(matrix(i, j, :)), because it is column major, probably this uses the cache very inefficiently (probably not using at all). However, I am not sure if this is the actual reason (since I know so less about Fortran). Question is, the convention in Fortran is to do operations on column vectors instead of row vectors ?
(BTW: I checked the speed of Fortran already using Intel's LAPACK libraries, and it is quite fast, so it is not related to any compiler or build issue.)
Thanks.
Mete
Try changing the order of your loops when doing matrix operations, e.g. if you have something like this in C:
for (i = 0; i < M; ++i) // for each row
{
for (j = 0; j < N; ++j) // for each col
{
// matrix operations on e.g. A[i][j]
}
}
then in Fortran you want the j (column) loop as the outer loop and the i (row) loop as the inner loop.
An alternative approach, which achieves the same thing, is to keep the loops as they are but change the definition of the array, e.g. if in C it's A[x][y][z][t] then in FORTRAN make it A[t][z][y][x], assuming that t is the fastest varying loop index, and x the slowest.
Since, as you write, Fortran is column major with the first index varying fastest in memory layout, so sum(matrix(i, j, :)) causes the summation of non-contiguous locations. If this is really the cause of slower operation, then you could redefine your matrix to have a different order of dimensions so that the current 3rd dimension is the 1st. Yes, if this is your main computation, rearrange the matrix to make the summation a column operation. Explicit looping should be as earlier indices fastest, as described by #PaulR. If you had previously thought of the optimum index order for C and are changing to Fortran, this is one aspect that might need changing. But while this is theoretically true, I doubt that it really matters that much in practice, unless perhaps the array is enormous. (The worse case would be that part of the array is in RAM and part in swap on disk!) The first rule about run-time speed issues is don't guess ... measure. It is usually the algorithm.