if i have
#pragma acc parallel loop gang num_gangs(4) \
num_workers(5) vector_length(6) private(arrayB)
{
for(j=0; j<len; j++)
{
...
}
}
region, i assume each of the 4 gangs will have a separate copy of arrayB (for this example, you can assume that arrayB is an integer array with 5 elements).
am i right in assuming that in above case each of 4 gangs has a private copy of arrayB (and that workers and vectors, i.e., 5 workers in a gang will see the single private copy of arrayB as shared among these 5 workers, and similarly vectors)? also, would
#pragma acc parallel loop num_gangs(4) \
num_workers(5) vector_length(6)
{
for(j=0; j<len; j++)
{
...
}
}
be same in terms of private copies of arrayB to the one above?
now assume,
#pragma acc parallel loop gang worker \
num_gangs(4) num_workers(5) \
vector_length(6) private(arrayB)
{
for(j=0; j<len; j++)
{
...
}
}
then, who has private copy of arrayB and who shares single private copy of arrayB? how many private copies of arrayB there are in total?
now assume,
#pragma acc parallel loop gang vector \
num_gangs(4) num_workers(5) \
vector_length(6) private(arrayB)
{
for(j=0; j<len; j++)
{
...
}
}
then, who has private copy of arrayB and who shares single private copy of arrayB? how many private copies of arrayB there are in total?
also, plz let me know if i am missing any other combinations that are possible.
The "private" clause applies to the lowest schedule (gang, worker, vector) being used on applied loop.
So a "loop gang private(arr.." will have a private array for each gang that is shared among the workers and vectors within that gang.
A "loop gang worker private(arr.." will have a private array for each worker that is shared among the vectors within that worker.
A "loop gang worker vector private(arr.." will have a private array for each vector that is not shared.
For case #1, the number of private arrays created will depend on the loop schedule applied by the compiler. If you're using the PGI compiler, look at the compiler feedback messages (-Minfo=accel) to see how the loop was scheduled. If this was a typo and you meant to include a "gang" here, then the number of private arrays would equal the number of gangs.
For #2, you have a "gang worker" schedule so the number of private arrays would be the product of the number of gangs and number of workers.
For #3, you have a "gang worker vector" schedule so the number of private arrays would be the product of the number of gangs, number of workers, and vector length.
Note that in general, I don't recommend using "num_workers" or "vector_length" except for more advanced performance tuning when the size of the inner loops are know to be smaller that the default size, or when adjusting for register usage. Otherwise, you're limiting the parallelism of the code.
I also only use "num_gangs" very infrequently. It only make sense to use when you have very large number (or size) of private arrays and limiting the number of gangs allows the private arrays to fit into the GPU's memory. Also, on very rare occasions when the number of gangs needs to be fixed for an algorithm (like for an RNG).
Related
Consider the following chunk of code:
int array[30000]
#pragma omp parallel
{
for( int a = 0; a < 1000; a++ )
{
#pragma omp for nowait
for( int i = 0; i < 30000; i++ )
{
/*calculations with array[i] and also other array entries happen here*/
}
}
}
Race conditions are not a concern in my application but I would like to enforce that each thread in the parallel regions takes care of exactly the same chunk of array at each run through the inner for loop.
It is my understanding that schedule(static) distributes the for-loop items based on the number of threads and the array length. However, it is not clear whether the distribution changes for different loops or different repetitions of the same loop (even when number of threads and length are the same).
What does the standard say about this? Is schedule(static) sufficient to enforce this?
I believe this quote from OpenMP Specification provides such a guarantee:
A compliant implementation of the static schedule must ensure that the same assignment of logical iteration numbers to threads will be used in two worksharing-loop regions if the following conditions are satisfied: 1) both worksharing-loop regions have the same number of loop iterations, 2) both worksharing-loop regions have the same value of chunk_size specified, or both worksharing-loop regions have no chunk_size specified, 3) both worksharing-loop regions bind to the same parallel region, and 4) neither loop is associated with a SIMD construct.
I made a very naive implementation of the mergesort algorithm, which i turned to work on CUDA with very minimal implementation changes, the algorith code follows:
//Merge for mergesort
__device__ void merge(int* aux,int* data,int l,int m,int r)
{
int i,j,k;
for(i=m+1;i>l;i--){
aux[i-1]=data[i-1];
}
//Copy in reverse order the second subarray
for(j=m;j<r;j++){
aux[r+m-j]=data[j+1];
}
//Merge
for(k=l;k<=r;k++){
if(aux[j]<aux[i] || i==(m+1))
data[k]=aux[j--];
else
data[k]=aux[i++];
}
}
//What this code do is performing a local merge
//of the array
__global__
void basic_merge(int* aux, int* data,int n)
{
int i = blockIdx.x*blockDim.x + threadIdx.x;
int tn = n / (blockDim.x*gridDim.x);
int l = i * tn;
int r = l + tn;
//printf("Thread %d: %d,%d: \n",i,l,r);
for(int i{1};i<=(tn/2)+1;i*=2)
for(int j{l+i};j<(r+1);j+=2*i)
{
merge(aux,data,j-i,j-1,j+i-1);
}
__syncthreads();
if(i==0){
//Complete the merge
do{
for(int i{tn};i<(n+1);i+=2*tn)
merge(aux,data,i-tn,i-1,i+tn-1);
tn*=2;
}while(tn<(n/2)+1);
}
}
The problem is that no matter how many threads i launch on my GTX 760, the sorting performance is always much much more worst than the same code on CPU running on 8 threads (My CPU have hardware support for up to 8 concurrent threads).
For example, sorting 150 million elements on CPU takes some hundred milliseconds, on GPU up to 10 minutes (even with 1024 threads per block)! Clearly i'm missing some important point here, can you please provide me with some comment? I strongly suspect the the problem is in the final merge operation performed by the first thread, at that point we have a certain amount of subarray (the exact amount depend on the number of threads) which are sorted and need to me merged, this is completed by just one thread (one tiny GPU thread).
I think i should use come kind of reduction here, so each thread perform in parallel further more merge, and the "Complete the merge" step just merge the last two sorted subarray..
I'm very new to CUDA.
EDIT (ADDENDUM):
Thanks for the link, I must admit I still need some time to learn better CUDA before taking full advantage of that material.. Anyway, I was able to rewrite the sorting function in order to take advantage as long as possible of multiple threads, my first implementation had a bottleneck in the last phase of the merge procedure, which was performed by only one multiprocessor.
Now after the first merge, I use each time up to (1/2)*(n/b) threads, where n is the amount of data to sort and b is the size of the chunk of data sorted by each threads.
The improvement in performance is surprising, using only 1024 threads it takes about ~10 seconds to sort 30 milion element.. Well, this is still a poor result unfortunately! The problem is in the threads syncronization, but first things first, let's see the code:
__global__
void basic_merge(int* aux, int* data,int n)
{
int k = blockIdx.x*blockDim.x + threadIdx.x;
int b = log2( ceil( (double)n / (blockDim.x*gridDim.x)) ) + 1;
b = pow( (float)2, b);
int l=k*b;
int r=min(l+b-1,n-1);
__syncthreads();
for(int m{1};m<=(r-l);m=2*m)
{
for(int i{l};i<=r;i+=2*m)
{
merge(aux,data,i,min(r,i+m-1),min(r,i+2*m-1));
}
}
__syncthreads();
do{
if(k<=(n/b)*.5)
{
l=2*k*b;
r=min(l+2*b-1,n-1);
merge(aux,data,l,min(r,l+b-1),r);
}else break;
__syncthreads();
b*=2;
}while((r+1)<n);
}
The function 'merge' is the same as before. Now the problem is that I'm using only 1024 threads instead of the 65000 and more I can run on my CUDA device, the problem is that __syncthreads does not work as sync primitive at grid level, but only at block level!
So i can syncronize up to 1024 threads,that is the amount of threads supported per block. Without a proper syncronization each thread mess up the data of the other, and the merging procedure does not work.
In order to boost the performance I need some kind of syncronization between all the threads in the grid, seems that no API exist for this purpose, and i read about a solution which involve multiple kernel launch from the host code, using the host as barrier for all the threads.
I have a certain plan on how to implement this tehcnique in my mergesort function, I will provide you with the code in the near future. Did you have any suggestion on your own?
Thanks
It looks like all the work is being done in __global __ memory. Each write takes a long time and each read takes a long time making the function slow. I think it would help to maybe first copy your data to __shared __ memory first and then do the work in there and then when the sorting is completed(for that block) copy the results back to global memory.
Global memory takes about 400 clock cycles (or about 100 if the data happens to be in L2 cache). Shared memory on the other hand only takes 1-3 clock cycles to write and read.
The above would help with performance a lot. Some other super minor things you can try are..
(1) remove the first __syncthreads(); It is not really doing anything because no data is being past in between warps at that point.
(2) Move the "int b = log2( ceil( (double)n / (blockDim.x*gridDim.x)) ) + 1; b = pow( (float)2, b);" outside the kernel and just pass in b instead. This is being calculated over and over when it really only needs to be calculated once.
I tried to follow along on your algorithm but was not able to. The variable names were hard to follow...or... your code is above my head and I cannot follow. =) Hope the above helps.
I have a clarification question.
It is my understanding, that sourceCpp automatically passes on the RNG state, so that set.seed(123) gives me reproducible random numbers when calling Rcpp code. When compiling a package, I have to add a set RNG statement.
Now how does this all work with openMP either in sourceCpp or within a package?
Consider the following Rcpp code
#include <Rcpp.h>
#include <omp.h>
// [[Rcpp::depends("RcppArmadillo")]]
// [[Rcpp::export]]
Rcpp::NumericVector rnormrcpp1(int n, double mu, double sigma ){
Rcpp::NumericVector out(n);
for (int i=0; i < n; i++) {
out(i) =R::rnorm(mu,sigma);
}
return(out);
}
// [[Rcpp::export]]
Rcpp::NumericVector rnormrcpp2(int n, double mu, double sigma, int cores=1 ){
omp_set_num_threads(cores);
Rcpp::NumericVector out(n);
#pragma omp parallel for schedule(dynamic)
for (int i=0; i < n; i++) {
out(i) =R::rnorm(mu,sigma);
}
return(out);
}
And then run
set.seed(123)
a1=rnormrcpp1(100,2,3,2)
set.seed(123)
a2=rnormrcpp1(100,2,3,2)
set.seed(123)
a3=rnormrcpp2(100,2,3,2)
set.seed(123)
a4=rnormrcpp2(100,2,3,2)
all.equal(a1,a2)
all.equal(a3,a4)
While a1 and a2 are identical, a3 and a4 are not. How can I adjust the RNG state with the openMP loop? Can I?
To expand on what Dirk Eddelbuettel has already said, it is next to impossible to both generate the same PRN sequence in parallel and have the desired speed-up. The root of this is that generation of PRN sequences is essentially a sequential process where each state depends on the previous one and this creates a backward dependence chain that reaches back as far as the initial seeding state.
There are two basic solutions to this problem. One of them requires a lot of memory and the other one requires a lot of CPU time and both are actually more like workarounds than true solutions:
pregenerated PRN sequence: One thread generates sequentially a huge array of PRNs and then all threads access this array in a manner that would be consistent with the sequential case. This method requires lots of memory in order to store the sequence. Another option would be to have the sequence stored into a disk file that is later memory-mapped. The latter method has the advantage that it saves some compute time, but generally I/O operations are slow, so it only makes sense on machines with limited processing power or with small amounts of RAM.
prewound PRNGs: This one works well in cases when work is being statically distributed among the threads, e.g. with schedule(static). Each thread has its own PRNG and all PRNGs are seeded with the same initial seed. Then each thread draws as many dummy PRNs as its starting iteration, essentially prewinding its PRNG to the correct position. For example:
thread 0: draws 0 dummy PRNs, then draws 100 PRNs and fills out(0:99)
thread 1: draws 100 dummy PRNs, then draws 100 PRNs and fills out(100:199)
thread 2: draws 200 dummy PRNs, then draws 100 PRNs and fills out(200:299)
and so on. This method works well when each thread does a lot of computations besides drawing the PRNs since the time to prewind the PRNG could be substantial in some cases (e.g. with many iterations).
A third option exists for the case when there is a lot of data processing besides drawing a PRN. This one uses OpenMP ordered loops (note that the iteration chunk size is set to 1):
#pragma omp parallel for ordered schedule(static,1)
for (int i=0; i < n; i++) {
#pragma omp ordered
{
rnum = R::rnorm(mu,sigma);
}
out(i) = lots of processing on rnum
}
Although loop ordering essentially serialises the computation, it still allows for lots of processing on rnum to execute in parallel and hence parallel speed-up would be observed. See this answer for a better explanation as to why so.
Yes, sourceCpp() etc and an instantiation of RNGScope so the RNGs are left in a proper state.
And yes one can do OpenMP. But inside of OpenMP segment you cannot control in which order the threads are executed -- so you longer the same sequence. I have the same problem with a package under development where I would like to have reproducible draws yet use OpenMP. But it seems you can't.
I am using an algorithm (implemented in C) that generates partitions of a set. (The code is here: http://www.martinbroadhurst.com/combinatorial-algorithms.html#partitions).
I was wondering if there is a way to modify this algorithm to run in parallel instead of linearly?
I've got multiple cores on my CPU and would like split up the generation of partitions into multiple running threads.
Initialize a shared collection containing every partition of the first k elements. Each thread, until the collection is empty, repeatedly removes a partition from the collection and generates all possibilities for the remaining n - k elements using the algorithm you linked to (get another k-element prefix when incrementing the current n-element partition would change the one of the first k elements).
As you can see your referred algorithms creates counter in base n and each time put items with same number in one group, and in such a way partitions input.
Each counter counts from 0 to (0,1,2,...,n-1) which means A=n-1+(n-2)*n+...+1*nn-1+0 numbers. So you can run your algorithm on k different thread, in first thread you should count from 0 to A/k, in second you should count from (A/k)+1 to 2*A/k and so on. means just you should add a long variable and check it with upper bound (in your for loop conditions) Also calculating A value and related number in base n format for r*A/k for 0 <= r <= k.
First, consider the following variation of the serial algorithm. Take the element a, and assign it to the subset #0 (this is always valid, because the order of subsets inside a partition does not matter). The next element b might belong either to the same subset as a or to a different one, i.e. to subset #1. Then, the element c belongs to either #0 (together with a) or #1 (together with b if it's separate from a), or to its own subset (which will be #1 if #0={a,b}, or #2 if #0={a} and #1={b}). And so on. So you add new elements one by one to partially built partitions, producing a few possible outputs for each input - until you put all the elements. The key to parallelization is that each incomplete partition can be appended with new elements independently, i.e. in parallel with, all other variants.
The algorithm can be implemented in different ways. I would use a recursive approach, in which a function is given a partially filled array and its current length, copies the array as many times as there are possible values for the next element (which is one more than the current last value of the array), sets the next element to every possible value and calls itself recursively for each new array, with increased length. This approach seems particularly good for work-stealing parallel engines, such as cilk or tbb. An implementation similar to suggested by #swen is also possible: you use a collection of all incomplete partitions and a pool of threads, and each thread takes one partition from the collection, produces all possible extensions and put those back to the collection; partitions with all elements added should obviously go into a different collection.
Here is the c++ implementation I obtained using swen's suggestion. The number of threads depends on the value of r. For r=6 the number of partitions is the sixth bell number, which is equal to 203. For r=0 we just get a normal non-parallel program.
#include "omp.h"
#include <bits/stdc++.h>
using namespace std;
typedef long long lli;
const int MAX=10010;
const int MX=100;
int N,r=6;
int F[MAX]; // partitions first r
int Fa[MAX][MX]; // complete partitions
int P[MAX]; // first appearances first r
int Pa[MAX][MX]; // first appearances complete
int next(){// iterates to next partition of first r
for(int i=r-1;i>=0;i--){
P[F[i]]=i;
}
for(int i=r-1;i>=0;i--){
if( P[F[i]]!=i ){
F[i]++;
for(int j=i+1;j<r;j++){
F[j]=0;
}
return(1);
}
}
return(0);
}
int sig(int ID){// iterates to next partition in thread
for(int i=N-1;i>=0;i--){
Pa[ID][Fa[ID][i]]=i;
}
for(int i=N-1;i>=r;i--){
if( Pa[ID][Fa[ID][i]]!=i){
Fa[ID][i]++;
for(int j=i+1;j<N;j++){
Fa[ID][j]=0;
}
return(1);
}
}
return(0);
}
int main(){
int N;
scanf("%d",&N);
int t=1,partitions=0;
while(t || next() ){// save the current partition so we can use it for a thread later
t=0;
for(int i=0;i<r;i++){
Fa[partitions][i]=F[i];
}
partitions++;
}
omp_set_num_threads(partitions);
#pragma omp parallel
{
int ID = omp_get_thread_num();
int t=1;
while(t || sig(ID) ){// iterate through each partition in the thread
// the current partition in the thread is found in Fa[ID]
}
}
}
Which one can gain a better performance?
Example 1
#pragma omp parallel for private (i,j)
for(i = 0; i < 100; i++) {
for (j=0; j< 100; j++){
....do sth...
}
}
Example 2
for(i = 0; i < 100; i++) {
#pragma omp parallel for private (i,j)
for (j=0; j< 100; j++){
....do sth...
}
}
Follow up question Is it valid to use Example 3?
#pragma omp parallel for private (i)
for(i = 0; i < 100; i++) {
#pragma omp parallel for private (j)
for (j=0; j< 100; j++){
....do sth...
}
}
In general, Example 1 is the best as it parallelizes the outer most loop, which minimizes thread fork/join overhead. Although many OpenMP implementations pre-allocate the thread pool, there are still overhead to dispatch logical tasks to worker threads (a.k.a. a team of thread) and join them. Also note that when you use a dynamic scheduling (e.g., schedule(dynamic, 1)), then this task dispatch overhead would be problematic.
So, Example 2 may incur significant parallel overhead, especially when the trip count of for-i is large (100 is okay, though), and the amount of workload of for-j is small. Small may be an ambiguous term and depends on many variables. But, less than 1 millisecond would be definitely wasteful to use OpenMP.
However, in case where the for-i is not parallelizable and only for-j is parallelizable, then Example2 is the only option. In this case, you must consider carefully whether the amount of parallel workload can offset the parallel overhead.
Example3 is perfectly valid once for-i and for-j are safely parallelizable (i.e., no loop-carried flow dependences in each two loops, respectively). Example3 is called nested parallelism. You may take a look this article. Nested parallelism should be used with care. In many OpenMP implementations, you need to manually turn on nested parallelism by calling omp_set_nested. However, as nested parallelism may spawn huge number of threads, its benefit may be significantly reduced.
It depends on the amount your doing in the inner loop. If it's small, lauching too many threads will represent a overhead. If the work is big, I would probabaly go with option 2, depending on the number of cores your machines has.
BTW, the only place where you need to flag a variable as private is "j" in example 1. In all the other cases it's implicit.