Nested data environment with different subparts of the same array - openacc

Here is my question about openacc.
I read the APIs (v1 and v2), and the behavior of nested data environment with different subparts of the same array is unclear to me.
Code example:
#pragma acc data pcopyin(a[0:20])
{
#pragma acc data pcopyin(a[100:20])
{
#pragma acc parallel loop
for(i=0; i<20; i++)
a[i] = i;
a[i+100] = i;
}
}
My understanding is that this should work (or at leaste the two acc data parts):
The first pragma checks if a[0,20] is on the accelerator
NO -> data are allocated on the device and transferred
The second pragma checks if a[100,120] is on the accelerator
The pointer a is on the accelerator, but not the data from a[100,120]
The data are allocated on the device and transferred
I tried this kind of thing with CAPS compiler (v3.3.0 which is the only available right now on my test machine), and the second pragma acc data returns me an error (my second subarray don't have the correct shape).
So what happens with my test (I suppose) is that the pointer "a" was found on the accelerator, but the shape associated with it ([0:20]) is not the same in my second pragma ([100:20]).
Is this the normal behavior planned in the API, or should my example work?
Moreover, if this is supposed to work, is there some sort of coherence between the subparts of the same array (somehow, they will be positionned like on the host and I will be able to put a[i] += a[100+i] in my kernel)?

The present test will be looking if "a" is on the device. Hence, when the second data region is encountered, "a" is already on the device but only partially. Instead, a better method would be to add a pointer to point into "a" and reference this pointer on the device. Something like:
#include <stdio.h>
int main () {
int a[200];
int *b;
int i;
for(i=0; i<200; i++) a[i] = 0;
b=a+100;
#pragma acc data pcopy(a[0:20])
{
#pragma acc data pcopy(b[0:20])
{
#pragma acc parallel loop
for(i=0; i<20; i++) {
a[i] = i;
b[i] = i;
}
}
}
for(i=0; i<22; i++) printf("%d = %d \n", i, a[i]);
for(i=100; i<122; i++) printf("%d = %d \n", i, a[i]);
return 0;
}
If you had just copied "a[100:20]", then accessing outside this range would be considered a programmer error.
Hope this helps,
Mat

Related

OpenMP double for loop

I'd like to use openMP to apply multi-thread.
Here is simple code that I wrote.
vector<Vector3f> a;
int i, j;
for (i = 0; i<10; i++)
{
Vector3f b;
#pragma omp parallel for private(j)
for (j = 0; j < 3; j++)
{
b[j] = j;
}
a.push_back(b);
}
for (i = 0; i < 10; i++)
{
cout << a[i] << endl;
}
I want to change it to works lik :
parallel for1
{
for2
}
or
for1
{
parallel for2
}
Code works when #pragma line is deleted. but it does not work when I use it. What's the problem?
///////// Added
Actually I use OpenMP to more complicated example,
double for loop question.
here, also When I do not apply MP, it works well.
But When I apply it,
the error occurs at vector push_back line.
vector<Class> B;
for 1
{
#pragma omp parallel for private(j)
parallel for j
{
Class A;
B.push_back(A); // error!!!!!!!
}
}
If I erase B.push_back(A) line, it works as well when I applying MP.
I could not find exact error message, but it looks like exception error about vector I guess. Debug stops at
void _Reallocate(size_type _Count)
{ // move to array of exactly _Count elements
pointer _Ptr = this->_Getal().allocate(_Count);
_TRY_BEGIN
_Umove(this->_Myfirst, this->_Mylast, _Ptr);
std::vector::push_back is not thread safe, you cannot call that without any protection against race conditions from multiple threads.
Instead, prepare the vector such that it's size is already correct and then insert the elements via operator[].
Alternatively you can protect the insertion with a critical region:
#pragma omp critical
B.push_back(A);
This way only one thread at a time will do the insertion which will fix the error but slow down the code.
In general I think you don't approach parallelization the right way, but there is no way to give better advise without a clearer and more representative problem description.

OpenMP: How to copy back value of firstprivate variable back to global

I am new to OpenMP and I am stuck with a basic operation. Here is a sample code for my question.
#include <omp.h>
int main(void)
{
int A[16] = {1,2,3,4,5 ...... 16};
#pragma omp parallel for firstprivate(A)
for(int i = 0; i < 4; i++)
{
for(int j = 0; j < 4; j++)
{
A[i*4+j] = Process(A[i*4+j]);
}
}
}
As evident,value of A is local to each thread. However, at the end, I want to write back part of A calculated by each threadto the corresponding position in global variable A. How this can be accomplished?
Simply make A shared. This is fine, because all loop iterations operate on separate elements of A. Remember that OpenMP is shared memory programming.
You can do so explicitly by using shared instead of firstprivate, or simply remove the declaration:
int A[16] = {1,2,3,4,5 ...... 16};
#pragma omp parallel for
for(int i = 0; i < 4; i++)
By default all variables declared outside of the parallel region. You can find an extended exemplary description in this answer.

shared arrays in OpenMP

I'm trying to parallelize a piece of C++ code with OpenMp but I'm facing some problems.
In fact, my parallelized code is not faster than the serial one.
I think I have understood the cause of this, but I'm not able to solve it.
The structure of my code is like this:
int vec1 [M];
int vec2 [N];
...initialization of vec1 and vec2...
for (int it=0; it < tot_iterations; it++) {
if ( (it+1)%2 != 0 ) {
#pragma omp parallel for
for (int j=0 ; j < N ; j++) {
....code involving a call to a function to which I'm passing as a parameter vec1.....
if (something) { vec2[j]=vec2[j]-1;}
}
}
else {
# pragma omp parallel for
for (int i=0 ; i < M ; i++) {
....code involving a call to a function to which I'm passing as a parameter vec2.....
if (something) { vec1[i]=vec1[i]-1;}
}
}
}
I thought that maybe my parallelized code is slower because multiple threads want to access to the same shared array and one has to wait until another has finished, but I'm not sure how things really go. But I can't make vec1 and vec2 private since the updates wouldn't be seen in the other iterations...
How can I improve it??
When you speak about issue when accessing the same array with multiple thread, this is called "false-sharing". Except if your array is small, it should not be the bottle neck here as pragma omp parallel for use static scheduling in default implementation (with gcc at least) so each thread should access most of the array without concurency except if your "...code involving a call to a function to which I'm passing as a parameter vec2....." really access a lot of elements in the array.
Case 1: You do not access most elements in the array in this part of the code
Is M big enough to make parallelism useful?
Can you move parallelism on the outer loop? (with one loop for vec1 only and the other for vec2 only)
Try to move the parallel region code :
int vec1 [M];
int vec2 [N];
...initialization of vec1 and vec2...
#pragma omp parallel
for (int it=0; it < tot_iterations; it++) {
if ( (it+1)%2 != 0 ) {
#pragma omp for
for (int j=0 ; j < N ; j++) {
....code involving a call to a function to which I'm passing as a parameter vec1.....
if (something) { vec2[j]=vec2[j]-1;}
}
}
else {
# pragma omp for
for (int i=0 ; i < M ; i++) {
....code involving a call to a function to which I'm passing as a parameter vec2.....
if (something) { vec1[i]=vec1[i]-1;}
}
}
This should not change much but some implementation have a costly parallel region creation.
case 2: You access every elements with every thread
I would say you can't do that if you perform update, otherwise, you may have concurency issue as you have order dependency in the loop.

rewriting a simple C++ Code snippet into CUDA Code

I have written the following simple C++ code.
#include <iostream>
#include <omp.h>
int main()
{
int myNumber = 0;
int numOfHits = 0;
cout << "Enter my Number Value" << endl;
cin >> myNumber;
#pragma omp parallel for reduction(+:numOfHits)
for(int i = 0; i <= 100000; ++i)
{
for(int j = 0; j <= 100000; ++j)
{
for(int k = 0; k <= 100000; ++k)
{
if(i + j + k == myNumber)
numOfHits++;
}
}
}
cout << "Number of Hits" << numOfHits << endl;
return 0;
}
As you can see I use OpenMP to parallelize the outermost loop. What I would like to do is to rewrite this small code in CUDA. Any help will be much appreciated.
Well, I can give you a quick tutorial, but I won't necessarily write it all for you.
So first of all, you will want to get MS Visual Studio set up with CUDA, which is easy following this guide: http://www.ademiller.com/blogs/tech/2011/05/visual-studio-2010-and-cuda-easier-with-rc2/
Now you will want to read The NVIDIA CUDA Programming Guide (free pdf), documentation, and CUDA by Example (A book I highly recommend for learning CUDA).
But let's say you haven't done that yet, and definitely will later.
This is an extremely arithmetic heavy and data-light computation - actually it can be computed without this brute force method fairly simply, but that isn't the answer you are looking for. I suggest something like this for the kernel:
__global__ void kernel(int* myNumber, int* numOfHits){
//a shared value will be stored on-chip, which is beneficial since this is written to multiple times
//it is shared by all threads
__shared__ int s_hits = 0;
//this identifies the current thread uniquely
int i = (threadIdx.x + blockIdx.x*blockDim.x);
int j = (threadIdx.y + blockIdx.y*blockDim.y);
int k = 0;
//we increment i and j by an amount equal to the number of threads in one dimension of the block, 16 usually, times the number of blocks in one dimension, which can be quite large (but not 100,000)
for(; i < 100000; i += blockDim.x*gridDim.x){
for(; j < 100000; j += blockDim.y*gridDim.y){
//Thanks to talonmies for this simplification
if(0 <= (*myNumber-i-j) && (*myNumber-i-j) < 100000){
//you should actually use atomics for this
//otherwise, the value may change during the 'read, modify, write' process
s_hits++;
}
}
}
//synchronize threads, so we now s_hits is completely updated
__syncthreads();
//again, atomics
//we make sure only one thread per threadblock actually adds in s_hits
if(threadIdx.x == 0 && threadIdx.y == 0)
*numOfHits += s_hits;
return;
}
To launch the kernel, you will want something like this:
dim3 blocks(some_number, some_number, 1); //some_number should be hand-optimized
dim3 threads(16, 16, 1);
kernel<<<blocks, threads>>>(/*args*/);
I know you probably want a quick way to do this, but getting into CUDA isn't really a 'quick' thing. As in, you will need to do some reading and some setup to get it working; past that, the learning curve isn't too high. I haven't told you anything about memory allocation yet, so you will need to do that (although that is simple). If you followed my code, my goal is that you had to read up a bit on shared memory and CUDA, and so you are already kick-started. Good luck!
Disclaimer: I haven't tested my code, and I am not an expert - it could be idiotic.

How to parallelize an array shift with OpenMP?

How can I parallelize an array shift with OpenMP?
I've tryed a few things but didn't get any accurate results for the following example (which rotates the elements of an array of Carteira objects, for a permutation algorithm):
void rotaciona(int i)
{
Carteira aux = this->carteira[i];
for(int c = i; c < this->size - 1; c++)
{
this->carteira[c] = this->carteira[c+1];
}
this->carteira[this->size-1] = aux;
}
Thank you very much!
This is an example of a loop with loop-carried dependencies, and so can't be easily parallelized as written because the tasks (each iteration of the loop) aren't independent. Breaking the dependency can vary from a trivial modification to the completely impossible
(eg, an iteration loop).
Here, the case is somewhat in between. The issue with doing this in parallel is that you need to find out what your rightmost value is going to be before your neighbour changes the value. The OMP for construct doesn't expose to you which loop iterations values will be "yours", so I don't think you can use the OpenMP for worksharing construct to break up the loop. However, you can do it yourself; but it requires a lot more code, and it won't nicely reduce to the serial case any more.
But still, an example of how to do this is shown below. You have to break the loop up yourself, and then get your rightmost value. An OpenMP barrier ensures that no one starts modifying values until all the threads have cached their new rightmost value.
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
int main(int argc, char **argv) {
int i;
char *array;
const int n=27;
array = malloc(n * sizeof(char) );
for (i=0; i<n-1; i++)
array[i] = 'A'+i;
array[n-1] = '\0';
printf("Array pre-shift = <%s>\n",array);
#pragma omp parallel default(none) shared(array) private(i)
{
int nthreads = omp_get_num_threads();
int tid = omp_get_thread_num();
int blocksize = (n-2)/nthreads;
int start = tid*blocksize;
int end = start + blocksize - 1;
if (tid == nthreads-1) end = n-2;
/* we are responsible for values start...end */
char rightval = array[end+1];
#pragma omp barrier
for (i=start; i<end; i++)
array[i] = array[i+1];
array[end] = rightval;
}
printf("Array post-shift = <%s>\n",array);
return 0;
}
Though your sample doesn't show any explicit openmp pragma's, I don't think it could work easily:
you are doing an in-place operation with overlapping regions.
If you split the loop in chunks, you'll have race conditions at the boundaries (because el[n] gets copied from el[n+1], which might already have been updated in another thread).
I suggest that you do manual chunking (which can be done), but I suspect that openmp parallel for is not flexible enough (haven't tried), so you could just have a parallell region that does the work in chunks, and fixup the boundary elements after a thread barrier/end of parallel block
Other thoughts:
if your values are POD, you can use memmove instead
if you can, simply switch to a list
.
std::list<Carteira> items(3000);
// rotation is now simply:
items.push_back(items.front());
items.erase(items.begin());

Resources