OpenMP: Which examples can get a better performance gain? - performance

Which one can gain a better performance?
Example 1
#pragma omp parallel for private (i,j)
for(i = 0; i < 100; i++) {
for (j=0; j< 100; j++){
....do sth...
}
}
Example 2
for(i = 0; i < 100; i++) {
#pragma omp parallel for private (i,j)
for (j=0; j< 100; j++){
....do sth...
}
}
Follow up question Is it valid to use Example 3?
#pragma omp parallel for private (i)
for(i = 0; i < 100; i++) {
#pragma omp parallel for private (j)
for (j=0; j< 100; j++){
....do sth...
}
}

In general, Example 1 is the best as it parallelizes the outer most loop, which minimizes thread fork/join overhead. Although many OpenMP implementations pre-allocate the thread pool, there are still overhead to dispatch logical tasks to worker threads (a.k.a. a team of thread) and join them. Also note that when you use a dynamic scheduling (e.g., schedule(dynamic, 1)), then this task dispatch overhead would be problematic.
So, Example 2 may incur significant parallel overhead, especially when the trip count of for-i is large (100 is okay, though), and the amount of workload of for-j is small. Small may be an ambiguous term and depends on many variables. But, less than 1 millisecond would be definitely wasteful to use OpenMP.
However, in case where the for-i is not parallelizable and only for-j is parallelizable, then Example2 is the only option. In this case, you must consider carefully whether the amount of parallel workload can offset the parallel overhead.
Example3 is perfectly valid once for-i and for-j are safely parallelizable (i.e., no loop-carried flow dependences in each two loops, respectively). Example3 is called nested parallelism. You may take a look this article. Nested parallelism should be used with care. In many OpenMP implementations, you need to manually turn on nested parallelism by calling omp_set_nested. However, as nested parallelism may spawn huge number of threads, its benefit may be significantly reduced.

It depends on the amount your doing in the inner loop. If it's small, lauching too many threads will represent a overhead. If the work is big, I would probabaly go with option 2, depending on the number of cores your machines has.
BTW, the only place where you need to flag a variable as private is "j" in example 1. In all the other cases it's implicit.

Related

OpenMP: how to make for loop distribution depend only on array size?

Consider the following chunk of code:
int array[30000]
#pragma omp parallel
{
for( int a = 0; a < 1000; a++ )
{
#pragma omp for nowait
for( int i = 0; i < 30000; i++ )
{
/*calculations with array[i] and also other array entries happen here*/
}
}
}
Race conditions are not a concern in my application but I would like to enforce that each thread in the parallel regions takes care of exactly the same chunk of array at each run through the inner for loop.
It is my understanding that schedule(static) distributes the for-loop items based on the number of threads and the array length. However, it is not clear whether the distribution changes for different loops or different repetitions of the same loop (even when number of threads and length are the same).
What does the standard say about this? Is schedule(static) sufficient to enforce this?
I believe this quote from OpenMP Specification provides such a guarantee:
A compliant implementation of the static schedule must ensure that the same assignment of logical iteration numbers to threads will be used in two worksharing-loop regions if the following conditions are satisfied: 1) both worksharing-loop regions have the same number of loop iterations, 2) both worksharing-loop regions have the same value of chunk_size specified, or both worksharing-loop regions have no chunk_size specified, 3) both worksharing-loop regions bind to the same parallel region, and 4) neither loop is associated with a SIMD construct.

How to parallelise a nested loop with cross element dependencies in cuda?

I'm a beginner at cuda and am having some difficulties with it
If I have an input vector A and a result vector B both with size N, and B[i] depends on all elements of A except A[i], how can I code this without having to call a kernel multiple times inside a serial for loop? I can't think of a way to paralelise both the outer and inner loop simultaneously.
edit: Have a device with cc 2.0
example:
// a = some stuff
int i;
int j;
double result = 0;
for(i=0; i<1000; i++) {
double ai = a[i];
for(j=0; j<1000; j++) {
double aj = a[j];
if (i == j)
continue;
result += ai - aj;
}
}
I have this at the moment:
//in host
int i;
for(i=0; i<1000; i++) {
kernelFunc <<<2, 500>>> (i, d_a)
}
Is there a way to eliminate the serial loop?
Something like this should work, I think:
__global__ void my_diffs(const double *a, double *b, const length){
unsigned idx = threadIdx.x + blockDim.x*blockIdx.x;
if (idx < length){
double my_a = a[idx];
double result = 0.0;
for (int j=0; j<length; j++)
result += my_a - a[j];
b[idx] = result;
}
}
(written in browser, not tested)
This can possibly be further optimized in a couple ways, however for cc 2.0 and newer devices that have L1 cache, the benefits of these optimizations might be small:
use shared memory - we can reduce the number of global loads to one per element per block. However, the initial loads will be cached in L1, and your data set is quite small (1000 double elements ?) so the benefits might be limited
create an offset indexing scheme, so each thread is using a different element from the cacheline to create coalesced access (i.e. modify j index for each thread). Again, for cc 2.0 and newer devices, this may not help much, due to L1 cache as well as the ability to broadcast warp global reads.
If you must use a cc 1.x device, then you'll get significant mileage out of one or more optimizations -- the code I've shown here will run noticeably slower in that case.
Note that I've chosen not to bother with the special case where we are subtracting a[i] from itself, as that should be approximately zero anyway, and should not disturb your results. If you're concerned about that, you can special-case it out, easily enough.
You'll also get more performance if you increase the blocks and reduce the threads per block, perhaps something like this:
my_diffs<<<8,128>>>(d_a, d_b, len);
The reason for this is that many GPUs have more than 1 or 2 SMs. To maximize perf on these GPUs with such a small data set, we want to try and get at least one block launched on each SM. Having more blocks in the grid makes this more likely.
If you want to fully parallelize the computation, the approach would be to create a 2D matrix (let's call it c[...]) in GPU memory, of square dimensions equal to the length of your vector. I would then create a 2D grid of threads, and have each thread perform the subtraction (a[row] - a[col]) and store it's result in c[row*len+col]. I would then launch a second (1D) kernel to sum the columns of c (each thread has a loop to sum a column) to create the result vector b. However I'm not sure this would be any faster than the approach I've outlined. Such a "more fully parallelized" approach also wouldn't lend itself as easily to the optimizations I discussed.

set RNG state with openMP and Rcpp

I have a clarification question.
It is my understanding, that sourceCpp automatically passes on the RNG state, so that set.seed(123) gives me reproducible random numbers when calling Rcpp code. When compiling a package, I have to add a set RNG statement.
Now how does this all work with openMP either in sourceCpp or within a package?
Consider the following Rcpp code
#include <Rcpp.h>
#include <omp.h>
// [[Rcpp::depends("RcppArmadillo")]]
// [[Rcpp::export]]
Rcpp::NumericVector rnormrcpp1(int n, double mu, double sigma ){
Rcpp::NumericVector out(n);
for (int i=0; i < n; i++) {
out(i) =R::rnorm(mu,sigma);
}
return(out);
}
// [[Rcpp::export]]
Rcpp::NumericVector rnormrcpp2(int n, double mu, double sigma, int cores=1 ){
omp_set_num_threads(cores);
Rcpp::NumericVector out(n);
#pragma omp parallel for schedule(dynamic)
for (int i=0; i < n; i++) {
out(i) =R::rnorm(mu,sigma);
}
return(out);
}
And then run
set.seed(123)
a1=rnormrcpp1(100,2,3,2)
set.seed(123)
a2=rnormrcpp1(100,2,3,2)
set.seed(123)
a3=rnormrcpp2(100,2,3,2)
set.seed(123)
a4=rnormrcpp2(100,2,3,2)
all.equal(a1,a2)
all.equal(a3,a4)
While a1 and a2 are identical, a3 and a4 are not. How can I adjust the RNG state with the openMP loop? Can I?
To expand on what Dirk Eddelbuettel has already said, it is next to impossible to both generate the same PRN sequence in parallel and have the desired speed-up. The root of this is that generation of PRN sequences is essentially a sequential process where each state depends on the previous one and this creates a backward dependence chain that reaches back as far as the initial seeding state.
There are two basic solutions to this problem. One of them requires a lot of memory and the other one requires a lot of CPU time and both are actually more like workarounds than true solutions:
pregenerated PRN sequence: One thread generates sequentially a huge array of PRNs and then all threads access this array in a manner that would be consistent with the sequential case. This method requires lots of memory in order to store the sequence. Another option would be to have the sequence stored into a disk file that is later memory-mapped. The latter method has the advantage that it saves some compute time, but generally I/O operations are slow, so it only makes sense on machines with limited processing power or with small amounts of RAM.
prewound PRNGs: This one works well in cases when work is being statically distributed among the threads, e.g. with schedule(static). Each thread has its own PRNG and all PRNGs are seeded with the same initial seed. Then each thread draws as many dummy PRNs as its starting iteration, essentially prewinding its PRNG to the correct position. For example:
thread 0: draws 0 dummy PRNs, then draws 100 PRNs and fills out(0:99)
thread 1: draws 100 dummy PRNs, then draws 100 PRNs and fills out(100:199)
thread 2: draws 200 dummy PRNs, then draws 100 PRNs and fills out(200:299)
and so on. This method works well when each thread does a lot of computations besides drawing the PRNs since the time to prewind the PRNG could be substantial in some cases (e.g. with many iterations).
A third option exists for the case when there is a lot of data processing besides drawing a PRN. This one uses OpenMP ordered loops (note that the iteration chunk size is set to 1):
#pragma omp parallel for ordered schedule(static,1)
for (int i=0; i < n; i++) {
#pragma omp ordered
{
rnum = R::rnorm(mu,sigma);
}
out(i) = lots of processing on rnum
}
Although loop ordering essentially serialises the computation, it still allows for lots of processing on rnum to execute in parallel and hence parallel speed-up would be observed. See this answer for a better explanation as to why so.
Yes, sourceCpp() etc and an instantiation of RNGScope so the RNGs are left in a proper state.
And yes one can do OpenMP. But inside of OpenMP segment you cannot control in which order the threads are executed -- so you longer the same sequence. I have the same problem with a package under development where I would like to have reproducible draws yet use OpenMP. But it seems you can't.

OpenMP parallelizing matrix multiplication by a triple for loop (performance issue)

I'm writing a program for matrix multiplication with OpenMP, that, for cache convenience, implements the multiplication A x B(transpose) rows X rows instead of the classic A x B rows x columns, for better cache efficiency. Doing this I faced an interesting fact that for me is illogic: if in this code i parallelize the extern loop the program is slower than if I put the OpenMP directives in the most inner loop, in my computer the times are 10.9 vs 8.1 seconds.
//A and B are double* allocated with malloc, Nu is the lenght of the matrixes
//which are square
//#pragma omp parallel for
for (i=0; i<Nu; i++){
for (j=0; j<Nu; j++){
*(C+(i*Nu+j)) = 0.;
#pragma omp parallel for
for(k=0;k<Nu ;k++){
*(C+(i*Nu+j))+=*(A+(i*Nu+k)) * *(B+(j*Nu+k));//C(i,j)=sum(over k) A(i,k)*B(k,j)
}
}
}
Try hitting the result less often. This induces cacheline sharing and prevents the operation from running in parallel. Using a local variable instead will allow most of the writes to take place in each core's L1 cache.
Also, use of restrict may help. Otherwise the compiler can't guarantee that writes to C aren't changing A and B.
Try:
for (i=0; i<Nu; i++){
const double* const Arow = A + i*Nu;
double* const Crow = C + i*Nu;
#pragma omp parallel for
for (j=0; j<Nu; j++){
const double* const Bcol = B + j*Nu;
double sum = 0.0;
for(k=0;k<Nu ;k++){
sum += Arow[k] * Bcol[k]; //C(i,j)=sum(over k) A(i,k)*B(k,j)
}
Crow[j] = sum;
}
}
Also, I think Elalfer is right about needing reduction if you parallelize the innermost loop.
You could probably have some dependencies in the data when you parallelize the outer loop and compiler is not able to figure it out and adds additional locks.
Most probably it decides that different outer loop iterations could write into the same (C+(i*Nu+j)) and it adds access locks to protect it.
Compiler could probably figure out that there are no dependencies if you'll parallelize the 2nd loop. But figuring out that there are no dependencies parallelizing the outer loop is not so trivial for a compiler.
UPDATE
Some performance measurements.
Hi again. It looks like 1000 double * and + is not enough to cover the cost of threads synchronization.
I've done few small tests and simple vector scalar multiplication is not effective with openmp unless the number of elements is less than ~10'000. Basically, larger your array is, more performance will you get from using openmp.
So parallelizing the most inner loop you'll have to separate task between different threads and gather data back 1'000'000 times.
PS. Try Intel ICC, it is kinda free to use for students and open source projects. I remember being using openmp for smaller that 10'000 elements arrays.
UPDATE 2: Reduction example
double sum = 0.0;
int k=0;
double *al = A+i*Nu;
double *bl = A+j*Nu;
#pragma omp parallel for shared(al, bl) reduction(+:sum)
for(k=0;k<Nu ;k++){
sum +=al[k] * bl[k]; //C(i,j)=sum(over k) A(i,k)*B(k,j)
}
C[i*Nu+j] = sum;

OpenMP - running things in parallel and some in sequence within them

I have a scenario like:
for (i = 0; i < n; i++)
{
for (j = 0; j < m; j++)
{
for (k = 0; k < x; k++)
{
val = 2*i + j + 4*k
if (val != 0)
{
for(t = 0; t < l; t++)
{
someFunction((i + t) + someFunction(j + t) + k*t)
}
}
}
}
}
Considering this is block A, Now I have two more similar blocks in my code. I want to put them in parallel, so I used OpenMP pragmas. However I am not able to parallelize it, because I am a tad confused that which variables would be shared and private in this case. If the function call in the inner loop was an operation like sum += x, then I could have added a reduction clause.
In general, how would one approach parallelizing a code using OpenMP, when we there is a nested for loop, and then another inner for loop doing the main operation.
I tried declaring a parallel region, and then simply putting pragma fors before the blocks, but definitely I am missing a point there!
Thanks,
Sayan
I'm more of a Fortran programmer than C so my knowledge of OpenMP in C-style is poor, and I'll leave the syntax to you.
Your easiest approach here is probably (I'll qualify this later) to simply parallelise the outermost loop. By default OpenMP will regard variable i as private, all the rest as shared. This is probably not what you want, you probably want to make j and k and t private too. I suspect that you want val private also.
I'm a bit puzzled by the statement at the bottom of your nest of loops (ie someFunction...), which doesn't seem to return any value at all. Does it work by side-effects ?
So, you shouldn't need to declare a parallel region enclosing all this code, and you should probably only parallelise the outermost loop. If you were to parallelise the inner loops too you might find your OpenMP installation either ignoring them, spawning more processes than you have processors, or complaining bitterly.
I say that your easiest approach is probably to parallelise the outermost loop because I've made some assumptions about what your program (fragment) is doing. If the assumptions are wrong you might want to parallelise one of the inner loops. Another point to check is that the number of executions of the loop(s) you parallelise is much greater than the number of threads you use. You don't want to have OpenMP run loops with a trip count of, say, 7, on 4 threads, the load balance would be very poor.
You're correct, the innermost statement would rather be someFunction((i + t) + someFunction2(j + t) + k*t).

Resources