I'm writing a tool to test some graph algorithms. The tool has to go through all the edges in the graph and mark nodes at either end under certain conditions. It then has to go through all the nodes, ensuring they were all marked. The code should be pretty straight-forward, but I'm having some concurrency issues.
Here is my code:
#pragma omp parallel for reduction(+:seen_edges) default(shared)
for (size_t i = 0; i < n_edges; i++)
{
int64_t v0 = node_left(edges[i]), v1 = node_right(edges[i]);
// Do some work...
// This is where I mark nodes if the other end of the edge corresponds to the parent array
if (v0 != v1)
{
if(parents[v0] == v1)
reached[v0] = true;
if(parents[v1] == v0)
reached[v1] = true;
}
// Do more work...
}
#pragma omp parallel for default(shared)
for (size_t i = 0; i < n_nodes; i++)
{
if (i != source && !reached[i])
error("No traversed edge leading to node", i, &n_errors);
}
The reached array is initialised to false everywhere.
I can guarantee that on the input I'm using all the nodes should be marked, and thus no error should be printed. However, sometimes, some nodes remain unmarked.
I think memory shared should be consistent between OpenMP parallel regions, and I never set any element in reached to false except for at initialisation. The implicit barrier at the end of the first region should prevent any thread from going into the second one until all edges have been checked (and all nodes marked on this test input).
I see two possible options, but have no further explanation:
Some kind of data race is going on. But because I never set elements back to false, even if multiple threads try to write to a location at the same time, should that element eventually become true?
The elements are set to true, but memory is not consistent between threads in the second parallel region. Can this even happen in OpenMP?
If someone has any insight, I'd be grateful. Cheers.
Edit: The // Do work parts don't use the reached array, and parents is never modified in the program.
Related
Consider the following chunk of code:
int array[30000]
#pragma omp parallel
{
for( int a = 0; a < 1000; a++ )
{
#pragma omp for nowait
for( int i = 0; i < 30000; i++ )
{
/*calculations with array[i] and also other array entries happen here*/
}
}
}
Race conditions are not a concern in my application but I would like to enforce that each thread in the parallel regions takes care of exactly the same chunk of array at each run through the inner for loop.
It is my understanding that schedule(static) distributes the for-loop items based on the number of threads and the array length. However, it is not clear whether the distribution changes for different loops or different repetitions of the same loop (even when number of threads and length are the same).
What does the standard say about this? Is schedule(static) sufficient to enforce this?
I believe this quote from OpenMP Specification provides such a guarantee:
A compliant implementation of the static schedule must ensure that the same assignment of logical iteration numbers to threads will be used in two worksharing-loop regions if the following conditions are satisfied: 1) both worksharing-loop regions have the same number of loop iterations, 2) both worksharing-loop regions have the same value of chunk_size specified, or both worksharing-loop regions have no chunk_size specified, 3) both worksharing-loop regions bind to the same parallel region, and 4) neither loop is associated with a SIMD construct.
I am trying to use openmp tasks to schedule a tiled execution of basic jacobi2d computation. In jacobi2d there is a dependence on A(i,j) from
A(i, j)
A(i-1, j)
A(i+1, j)
A(i, j-1)
A(i, j+1).
To my understanding of the depend clause I am declaring the dependences correctly, but they are not being respected while executing the code. I have copied the simplified code piece below. Initially my guess was that the out of bounds ranges for some tiles might be causing this issue, so I corrected that but the issue persists.(I have not copied the longer code with corrected tile ranges as that part is just a bunch of ifs + max)
int n=8,tsteps=2,b=4; //n - size of matrix, tsteps - time iterations, b - tile size or block size
#pragma omp parallel
{
#pragma omp master
for (t=0; t<tsteps; ++t)
{
for (i=0; i<n; i+=b)
for (j=0; j<n; j+=b)
{
#pragma omp task firstprivate(t,i,j) depend(in:A[i-1:b+2][j-1:b+2]) depend(out:B[i:b][j:b])
{
#pragma omp critical
printf("t-%d i-%d j-%d --A",t,i,j); //Prints out time loop, i,j
}
}
for (i=0; i<n; i+=b)
for (j=0; j<n; j+=b)
{
#pragma omp task firstprivate(t,i,j) depend(in:B[i-1:b+2][j-1:b+2]) depend(out:A[i:b][j:b])
{
#pragma omp critical
printf("t-%d i-%d j-%d --B",t,i,j); //Prints out time loop, i,j
}
}
}
}
}
So the idea with declaring dependence starting from i-1 and j-1 and the range being (b+2) is that the neighbouring tiles also affect your current tiles calculations. And similarly for the second set of loop where values in A should only be overwritten once the neighbouring tiles have used the values.
Code is being compiled using gcc 5.3 which supports openmp 4.0.
ps: the way array range is declared above denotes the starting position and the number of indices to be considered while creating the dependence graph.
edit (based on Zulan's comment) - changed the inner code to a simple print statement as this will suffice to check order of task execution. Ideally for the above values(since there are only 4 tiles) all tiles should complete the first printf and then only execute the second. But if you execute the code it will mix the order.
So I finally figured out the issue, even though OpenMP specs say that depend clause is supposed to be implemented with a starting point and range, it has not been implemented yet in gcc. So currently it only compares the starting point from the depend clause (depend(in:A[i-1:b+2][j-1:b+2])) A[i-1][j-1] in this case.
Initially I was comparing elements in different relative tile positions. Eg comparing (0,0) element with the last element of the tile, which was giving a no conflicts with dependence and hence the random order of execution of various tasks.
Current gcc implementation does not care about the range provided in the clause at all.
I'm a beginner at cuda and am having some difficulties with it
If I have an input vector A and a result vector B both with size N, and B[i] depends on all elements of A except A[i], how can I code this without having to call a kernel multiple times inside a serial for loop? I can't think of a way to paralelise both the outer and inner loop simultaneously.
edit: Have a device with cc 2.0
example:
// a = some stuff
int i;
int j;
double result = 0;
for(i=0; i<1000; i++) {
double ai = a[i];
for(j=0; j<1000; j++) {
double aj = a[j];
if (i == j)
continue;
result += ai - aj;
}
}
I have this at the moment:
//in host
int i;
for(i=0; i<1000; i++) {
kernelFunc <<<2, 500>>> (i, d_a)
}
Is there a way to eliminate the serial loop?
Something like this should work, I think:
__global__ void my_diffs(const double *a, double *b, const length){
unsigned idx = threadIdx.x + blockDim.x*blockIdx.x;
if (idx < length){
double my_a = a[idx];
double result = 0.0;
for (int j=0; j<length; j++)
result += my_a - a[j];
b[idx] = result;
}
}
(written in browser, not tested)
This can possibly be further optimized in a couple ways, however for cc 2.0 and newer devices that have L1 cache, the benefits of these optimizations might be small:
use shared memory - we can reduce the number of global loads to one per element per block. However, the initial loads will be cached in L1, and your data set is quite small (1000 double elements ?) so the benefits might be limited
create an offset indexing scheme, so each thread is using a different element from the cacheline to create coalesced access (i.e. modify j index for each thread). Again, for cc 2.0 and newer devices, this may not help much, due to L1 cache as well as the ability to broadcast warp global reads.
If you must use a cc 1.x device, then you'll get significant mileage out of one or more optimizations -- the code I've shown here will run noticeably slower in that case.
Note that I've chosen not to bother with the special case where we are subtracting a[i] from itself, as that should be approximately zero anyway, and should not disturb your results. If you're concerned about that, you can special-case it out, easily enough.
You'll also get more performance if you increase the blocks and reduce the threads per block, perhaps something like this:
my_diffs<<<8,128>>>(d_a, d_b, len);
The reason for this is that many GPUs have more than 1 or 2 SMs. To maximize perf on these GPUs with such a small data set, we want to try and get at least one block launched on each SM. Having more blocks in the grid makes this more likely.
If you want to fully parallelize the computation, the approach would be to create a 2D matrix (let's call it c[...]) in GPU memory, of square dimensions equal to the length of your vector. I would then create a 2D grid of threads, and have each thread perform the subtraction (a[row] - a[col]) and store it's result in c[row*len+col]. I would then launch a second (1D) kernel to sum the columns of c (each thread has a loop to sum a column) to create the result vector b. However I'm not sure this would be any faster than the approach I've outlined. Such a "more fully parallelized" approach also wouldn't lend itself as easily to the optimizations I discussed.
I have a clarification question.
It is my understanding, that sourceCpp automatically passes on the RNG state, so that set.seed(123) gives me reproducible random numbers when calling Rcpp code. When compiling a package, I have to add a set RNG statement.
Now how does this all work with openMP either in sourceCpp or within a package?
Consider the following Rcpp code
#include <Rcpp.h>
#include <omp.h>
// [[Rcpp::depends("RcppArmadillo")]]
// [[Rcpp::export]]
Rcpp::NumericVector rnormrcpp1(int n, double mu, double sigma ){
Rcpp::NumericVector out(n);
for (int i=0; i < n; i++) {
out(i) =R::rnorm(mu,sigma);
}
return(out);
}
// [[Rcpp::export]]
Rcpp::NumericVector rnormrcpp2(int n, double mu, double sigma, int cores=1 ){
omp_set_num_threads(cores);
Rcpp::NumericVector out(n);
#pragma omp parallel for schedule(dynamic)
for (int i=0; i < n; i++) {
out(i) =R::rnorm(mu,sigma);
}
return(out);
}
And then run
set.seed(123)
a1=rnormrcpp1(100,2,3,2)
set.seed(123)
a2=rnormrcpp1(100,2,3,2)
set.seed(123)
a3=rnormrcpp2(100,2,3,2)
set.seed(123)
a4=rnormrcpp2(100,2,3,2)
all.equal(a1,a2)
all.equal(a3,a4)
While a1 and a2 are identical, a3 and a4 are not. How can I adjust the RNG state with the openMP loop? Can I?
To expand on what Dirk Eddelbuettel has already said, it is next to impossible to both generate the same PRN sequence in parallel and have the desired speed-up. The root of this is that generation of PRN sequences is essentially a sequential process where each state depends on the previous one and this creates a backward dependence chain that reaches back as far as the initial seeding state.
There are two basic solutions to this problem. One of them requires a lot of memory and the other one requires a lot of CPU time and both are actually more like workarounds than true solutions:
pregenerated PRN sequence: One thread generates sequentially a huge array of PRNs and then all threads access this array in a manner that would be consistent with the sequential case. This method requires lots of memory in order to store the sequence. Another option would be to have the sequence stored into a disk file that is later memory-mapped. The latter method has the advantage that it saves some compute time, but generally I/O operations are slow, so it only makes sense on machines with limited processing power or with small amounts of RAM.
prewound PRNGs: This one works well in cases when work is being statically distributed among the threads, e.g. with schedule(static). Each thread has its own PRNG and all PRNGs are seeded with the same initial seed. Then each thread draws as many dummy PRNs as its starting iteration, essentially prewinding its PRNG to the correct position. For example:
thread 0: draws 0 dummy PRNs, then draws 100 PRNs and fills out(0:99)
thread 1: draws 100 dummy PRNs, then draws 100 PRNs and fills out(100:199)
thread 2: draws 200 dummy PRNs, then draws 100 PRNs and fills out(200:299)
and so on. This method works well when each thread does a lot of computations besides drawing the PRNs since the time to prewind the PRNG could be substantial in some cases (e.g. with many iterations).
A third option exists for the case when there is a lot of data processing besides drawing a PRN. This one uses OpenMP ordered loops (note that the iteration chunk size is set to 1):
#pragma omp parallel for ordered schedule(static,1)
for (int i=0; i < n; i++) {
#pragma omp ordered
{
rnum = R::rnorm(mu,sigma);
}
out(i) = lots of processing on rnum
}
Although loop ordering essentially serialises the computation, it still allows for lots of processing on rnum to execute in parallel and hence parallel speed-up would be observed. See this answer for a better explanation as to why so.
Yes, sourceCpp() etc and an instantiation of RNGScope so the RNGs are left in a proper state.
And yes one can do OpenMP. But inside of OpenMP segment you cannot control in which order the threads are executed -- so you longer the same sequence. I have the same problem with a package under development where I would like to have reproducible draws yet use OpenMP. But it seems you can't.
I've been searching information on Peterson's algorithm but have come across references stating it does not satisfy starvation but only deadlock. Is this true? and if so can someone elaborate on why it does not?
Peterson's algorithm:
flag[0] = 0;
flag[1] = 0;
turn;
P0: flag[0] = 1;
turn = 1;
while (flag[1] == 1 && turn == 1)
{
// busy wait
}
// critical section
...
// end of critical section
flag[0] = 0;
P1: flag[1] = 1;
turn = 0;
while (flag[0] == 1 && turn == 0)
{
// busy wait
}
// critical section
...
// end of critical section
flag[1] = 0;
The algorithm uses two variables, flag and turn. A flag value of 1 indicates that the process wants to enter the critical section. The variable turn holds the ID of the process whose turn it is. Entrance to the critical section is granted for process P0 if P1 does not want to enter its critical section or if P1 has given priority to P0 by setting turn to 0.
As Ben Jackson suspects, the problem is with a generalized algorithm. The standard 2-process Peterson's algorithm satisfies the no-starvation property.
Apparently, Peterson's original paper actually had an algorithm for N processors. Here is a sketch that I just wrote up, in a C++-like language, that is supposedly this algorithm:
// Shared resources
int pos[N], step[N];
// Individual process code
void process(int i) {
int j;
for( j = 0; j < N-1; j++ ) {
pos[i] = j;
step[j] = i;
while( step[j] == i and some_pos_is_big(i, j) )
; // busy wait
}
// insert critical section here!
pos[i] = 0;
}
bool some_pos_is_big(int i, int j) {
int k;
for( k = 0; k < N-1; k++ )
if( k != i and pos[k] >= j )
return true;
}
return false;
}
Here's a deadlock scenario with N = 3:
Process 0 starts first, sets pos[0] = 0 and step[0] = 0 and then waits.
Process 2 starts next, sets pos[2] = 0 and step[0] = 2 and then waits.
Process 1 starts last, sets pos[1] = 0 and step[0] = 1 and then waits.
Process 2 is the first to notice the change in step[0] and so sets j = 1, pos[2] = 1, and step[1] = 2.
Processes 0 and 1 are blocked because pos[2] is big.
Process 2 is not blocked, so it sets j = 2. It this escapes the for loop and enters the critical section. After completion, it sets pos[2] = 0 but immediately starts competing for the critical section again, thus setting step[0] = 2 and waiting.
Process 1 is the first to notice the change in step[0] and proceeds as process 2 before.
...
Process 1 and 2 take turns out-competing process 0.
References. All details obtained from the paper "Some myths about famous mutual exclusion algorithms" by Alagarsamy. Apparently Block and Woo proposed a modified algorithm in "A more efficient generalization of Peterson's mutual exclusion algorithm" that does satisfy no-starvation, which Alagarsamy later improved in "A mutual exclusion algorithm with optimally bounded bypasses" (by obtaining the optimal starvation bound N-1).
A Rex is wrong with the deadlock situation.
(as a side note: the correct term would be starvation scenario, since for a deadlock there are at least two threads required to be 'stuck' see wikipedia: deadlock and starvation)
As process 2 and 1 go into level 0, step[0] is set to either 1 or 2 and thus making the advance condition of process 0 false since step[0] == 0 is false.
The peterson algorithm for 2 processes is a little simpler and does protect against starvation.
The peterson algorithm for n processes is much more complicated
To have a situation where a process starves the condition step[j] == i and some_pos_is_big(i, j) must be true forever. This implies that no other process enters the same level (which would make step[j] == i false) and that at least one process is always on the same level or on a higher level as i (to guarantee that some_pos_is_big(i, j) is kept true)
Moreover, only one process can be deadlocked in this level j. If two were deadlocked then for one of them step[j] == i would be false and therefor wouldn't be deadlocked.
So that means no process can't enter the same level and there must always be a a process in a level above.
As no other process could join the processes above (since they can't get into level j and therefor not above lelel j) at least one process must be deadlocked too above or the process in the critical section doesn't release the critical section.
If we assume that the process in the critical section terminates after a finite time, then only one of the above processes must be deadlocked.
But for that one to be deadlocked, another one above must be deadlocked etc.
However, there are only finite processes above, so eventually the top process can't be deadlocked, as it'll advance once the critical section is given free.
And therefor the peterson algorithm for n processes protects against starvation!
I suspect the comment about starvation is about some generalized, N-process Peterson's Algorithm. It is possible to construct an N-process version with bounded waiting, but without having one in particular to discuss we can't say why that particular generalization might be subject to starvation.
A quick Google turned up this paper which includes pseudocode. As you can see, the generalized version is much more complex (and expensive).