OpenMP data races without a collapse clause? - parallel-processing

I'm learning OpenMP and faced a weird (to me) issue with the collapse clause. My actual code is a lot longer than this, but I was able to reproduce my issue using this short version:
#include <stdio.h>
#include <stdlib.h>
int main()
{
size_t nrows = 10, ncols = 10;
unsigned int *cells;
int row, col;
cells = calloc(nrows * ncols, sizeof *cells);
#pragma omp parallel for
for (row = 0; row < nrows; row++) {
for (col = 0; col < ncols; col++)
if (row * col % 10)
cells[row * nrows + col] = 1;
}
for (row = 0; row < nrows; row++) {
for (col = 0; col < ncols; col++)
printf("%d ", cells[row * nrows + col]);
printf("\n");
}
}
I get the expected output using one thread OMP_NUM_THREADS=1 ./test:
0 0 0 0 0 0 0 0 0 0
0 1 1 1 1 1 1 1 1 1
0 1 1 1 1 0 1 1 1 1
0 1 1 1 1 1 1 1 1 1
0 1 1 1 1 0 1 1 1 1
0 1 0 1 0 1 0 1 0 1
0 1 1 1 1 0 1 1 1 1
0 1 1 1 1 1 1 1 1 1
0 1 1 1 1 0 1 1 1 1
0 1 1 1 1 1 1 1 1 1
and its md5sum is 94ae20845c84c865dbea94918ac5f06e. Now, if I run it using more than one thread many times, it sometimes generates different results.
$ for i in `seq 1 100`; do OMP_NUM_THREADS=2 ./test | md5sum; done | sort -u
94ae20845c84c865dbea94918ac5f06e *-
b1123bcfe82797548237998874cd0fd5 *-
What's more interesting? If I add collapse(2) to parallel for, I get the expected output consistently.
I also tried:
#pragma omp parallel for
for (row = 0; row < nrows; row++) {
for (col = 0; col < ncols; col++)
if (row * col % 10)
#pragma omp atomic write
cells[row * nrows + col] = 1;
#pragma omp flush
}
but it didn't help.
My only uneducated theory is that without collapse(2), after each col for loop, a second thread somehow overwrites the result of its previous thread with its initial 0 cell values, which it never touched because each thread updates its own portion of the cells array (row * nrows + col is unique). Is it false sharing because multiple threads try to access cells close to each other? Then, why does it only happen without collapse(2) and is it still safe with collapse(2)?
FYI, I used MinGW GCC from MSYS2.
The reason why I want to parallelize the outer loop only is the cost of collapsing with dynamic scheduling. Any help would be appreciated!

TL;DR answer: You have a data race on variable col, which does not occur if you use the collapse(2) clause because it privatizes both loop variables.
Details: You have to examine the sharing attributes of your loop variables(row and col) to understand what is happening here.
If collapse(2) clause is not used
#pragma omp parallel for
for (row = 0; row < nrows; row++) {
for (col = 0; col < ncols; col++)
the sharing attribute of variable row is private, because loop variables are implicitly privatized. Variable col, however, is shared. It means that different threads use the same variable (memory location) creating a race condition. That is the reason you sometimes obtain unexpected results.
On the other hand, if the collapse(2) clause is used, both loop variables are privatized, so there is no race condition, and you always obtain the correct result.
To fix your code (without using the collapse(2) clause) you have to make col private. To do so you have 2 alternatives:
a) The preferred method is to define your variables in their minimum required scope. Note that variables declared in the parallel region are private by default.
#pragma omp parallel for
for (size_t row = 0; row < nrows; row++) {
for (size_t col = 0; col < ncols; col++)
b) the alternative is to explicitly use the private clause:
#pragma omp parallel for private(col)
for (row = 0; row < nrows; row++) {
for (col = 0; col < ncols; col++)
Note that line cells[row * nrows + col] = 1; is free from data race, so atomic operation and flush are not necessary.

Related

How to print (output) multiple lines from a loop?

I would like to know how to output multiple lines from a for loop.
For example, if I want to input a number of subsequent lines N and then another numbers. I am trying to output the inputs that I provided below, but whenever I do this it keeps returning only the last digits and not everything I entered.
This is what I know so far.
#include <iostream>
using namespace std;
int main()
{
int N, M, num;
cin >> N;
for (int i = 0; i < N; i++)
{
for (int j = 0; j < 3; j++)
{
cin >> M;
for (int k = 0; k < M; k++)
cout << M << endl;
}
}
return 0;
}
Input:
2 (This is for N)
1 2 3
4 5 6
or
3
10 20 30
50 100 500
1000 5 0
---------
Output:
1 2 3
4 5 6
10 20 30
50 100 500
1000 5 0

While loop end when i gets infinity close to zero?

So I was wondering the time complexity of the following problem. The correct solution says it's O(logN); I understand this if this loop terminates. But since we are only halve i so theoretically i can get really close to 0, but never ends?!
int a = 0, i = N;
while (i > 0) {
a += i;
i /= 2;
}
It's important to keep in mind that you're dividing an int, not a floating-point value, so there is no fractional component. Instead, any remainder will simply be discarded. So once you get down to 1, half of that is 0.5, and since you take the integer part only, you will have 0. So therefore, this will eventually finish.
For example, if you started with 10:
10 / 2 is 5
5 / 2 is 2 with a remainder of 1—remainder is discarded—i is 2
2 / 2 is 1
1 / 2 is 0 with a remainder of 1—remainder is discarded—i is 0
Since you're using integer arithmetic, i does go to 0.
Here's an interactive example using picoc
$ picoc -i
starting picoc v2.1
picoc> #include <stdio.h>
#include <stdio.h>
picoc> int a=0, i=10;
int a=0, i=10;
picoc> a += i; i /= 2; printf("a=%d i=%d\n",a,i);
a += i; i /= 2; printf("a=%d i=%d\n",a,i);
a=10 i=5
picoc> a += i; i /= 2; printf("a=%d i=%d\n",a,i);
a += i; i /= 2; printf("a=%d i=%d\n",a,i);
a=15 i=2
picoc> a += i; i /= 2; printf("a=%d i=%d\n",a,i);
a += i; i /= 2; printf("a=%d i=%d\n",a,i);
a=17 i=1
picoc> a += i; i /= 2; printf("a=%d i=%d\n",a,i);
a += i; i /= 2; printf("a=%d i=%d\n",a,i);
a=18 i=0
Your loop would have terminated at this point.
Yes, the loop will actually end. Since i is an int, when you halve i you are performing integer division. The result of this division will get rounded down to the nearest integer.
For example:
int i=3;
int j= i/2;
// j really is 1.5, but we're performing integer division
// so the result will be j =1
If we consider a run of your program for N=5, we have:
First iteration:
i=5;
i = i/2 = 5/2 = 2.5 = 2; //Round 2.5 down
Second iteration:
i=2
i = i/2 = 2/2 = 1;
Third iteration:
i=1;
i = i/2 = 1/2 = 0.5 = 0; //Round 0.5 down, loop finishes

Can separate, forked, processes jointly access shared dynamically allocated, pointer, memory?

I am trying to parallelize the addition of two simple 4x4 matrices. The child process adds only odd rows and the parents adds even ones. However, I can't seem to get the processes to work on shared pointer memory and the output is always given in halves, shown beneath the code:
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
int main() {
int A[4][4] = {{1,2,3,4},{6,7,8,9},{11,12,13,14},{16,17,18,19}};
int B[4][4] = {{1,2,3,4},{6,7,8,9},{11,12,13,14},{16,17,18,19}};
int** C = (int**)malloc(sizeof(int)*4);
for (int z = 0; z<4; z++) {
C[z] = (int*)malloc(sizeof(int)*4);
}
pid_t cp;
printf("Before fork\n");
cp = fork();
if (cp == 0) {
for(int i = 1; i < 4; i+=2) {
for(int j = 0; j < 4; j++) {
printf("child process\n");
printf("We are adding cell %d,%d\n",i,j);
C[i][j] = A[i][j] + B[i][j];
sleep(1);
}
}
} else {
for (int k = 0; k < 4; k+=2) {
for(int l = 0; l < 4; l++) {
printf("parent process\n");
printf("We are adding cell %d,%d\n",k,l);
C[k][l] = A[k][l] + B[k][l];
sleep(1);
}
}
}
sleep(10);
printf("We are printing C here\n");
for (int m = 0; m < 4; m++) {
for(int n = 0; n < 4; n++) {
printf("%d ",C[m][n]);
}
printf("\n");
}
}
This is the output of the final for loop in the above code:
We are printing C here
0 0 0 0
12 14 16 18
0 0 0 0
32 34 36 38
We are printing C here
2 4 6 8
0 0 0 0
22 24 26 28
0 0 0 0

Crossing out bad lines from the binary matrix

We are given square binary matrix with side n.
We will consider any row or column which contains at least one 0 as 'bad'.
Task is to nullify all bad rows and columns.
Task requires to use O(1) of additional memory.
1 1 0 0 0 0
1 1 1 => 1 0 0
1 0 1 0 0 0
Tough thing is, that we cannot nullify bad lines as we discover them during traversal (otherwise we will always end up with zeroed matrix). So I am looking for such a data structure or such a way of data representation, so it could store all info about bad rows and columns while algorithm is iterating through matrix.
Actually we only need 2n bits to get the answer: we need to know for each row and column if it is good (1) or bad (0). The answer in each cell would be product of the answers for row and column.
Let's store most of that information in the matrix itself:
we can use first row to keep records (0 or 1) for all columns but first,
first column to keep records for all rows but first, and we need two more bits to keep records for first row and first column.
At first we get those two additional bits (checking first row and first column).
Then find and store records for other rows and columns.
Then calculate resulting bits in all the matrix except for first row and column.
And finally: first row should be nullified if it was bad and kept as it is otherwise, and the same is to be done with the first column.
As the first step, look for a 0 in the grid. If we can't find one, we're done.
If we found one, we know that we're supposed to nullify all 1's in the 0's row and column.
So, since we know the final value of all those cells, we could use that row and column as temporary boolean flags for whether the same row or column contains any 0's.
The exact process:
Look through the matrix to find a 0.
Keeping track of the coordinates of the found 0.
Looping over each row and column, checking whether that row or column contains a 0, and setting the flag appropriately.
Then looping over the matrix again and, for each 1 not in the flag row or column, checking whether either the row or the column flag is set, and, if it is, set that 1 to 0.
Then setting all cells in the row and column acting as flags to 0.
This runs in linear time (O(mn) with m rows and n columns) and O(1) space.
Example:
Input:
1 1 0 1 0
1 1 1 0 1
1 0 1 1 1
1 1 1 1 1
1 1 1 1 1
Then we look for a zero, and let's say we find the top-middle one.
Then we use the top row and middle column as flag for whether the same row / column contains a 0:
0 1 0 1 1
1
1
0
0
Then we loop over the other cells setting the 1's to 0's if the flag row / column is set:
0 0 0 0
0 0 0 0
1 0 0 0
1 0 0 0
Then we set the flag row and column to 0's:
0 0 0 0 0
0
0
0
0
Then we have our final output:
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
1 0 0 0 0
1 0 0 0 0
This would obviously be done in-place, I just separated it out visually for clarity.
I have written C++ implementation with helpful answer of Natalya Ginzburg.
Just leaving it here in case that it could be useful for somebody.
#include <stdio.h>
#include <math.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
#include <unistd.h>
class Matrix {
public:
Matrix(int n):
side(n),
stride( round(n/8+1) ),
is0RowGood(true),
is0ColGood(true) {
printf("%d %d\n", side, stride);
data = new char[stride*side];
memset(data, 0, stride*side*sizeof(char) );
if( !data ) {
printf("alloc problem\n");
exit(1);
}
fill();
print();
}
~Matrix() {
if(data)
delete data;
}
void process() {
for( int j = 0; j < side; ++j ) {
if(getEl(0, j) == false) {
is0RowGood = false;
break;
}
}
for( int i = 0; i < side; ++i ) {
if(getEl(i, 0) == false) {
is0ColGood = false;
break;
}
}
for( int i = 1; i < side; ++i ) {
for( int j = 1; j < side; ++j ) {
if(!getEl(i,j)) {
setEl(i,0, false);
break;
}
}
}
for( int j = 1; j < side; ++j ) {
for( int i = 1; i < side; ++i ) {
if(!getEl(j,i)) {
setEl(0, i, false);
break;
}
}
}
// nullify now
for( int i = 1; i < side; ++i ) {
for( int j = 1; j < side; ++j ) {
if( !getEl(0,j) || !getEl(i,0) )
{
crossRow(i);
crossCol(j);
}
}
}
if(!is0RowGood)
crossRow(0);
if(!is0ColGood)
crossCol(0);
printf("--\n");
print();
}
private:
void crossRow(int x) {
for(int i = 0; i < side; ++i ) {
setEl(x, i, false);
}
}
void crossCol(int x) {
for(int i = 0; i < side; ++i ) {
setEl(i, x, false);
}
}
void print() {
for( int i = 0; i < side; ++i ) {
for( int j = 0; j < side; ++j ) {
printf(" %d ", getEl(i,j));
}
printf("\n");
}
}
void fill() {
for( int i = 0; i < side; ++i ) {
for( int j = 0; j < side; ++j ) {
usleep(15);
setEl(i, j, (rand() % 30 == 0) ? 0 : 1);
}
}
}
bool getEl(int i, int j) {
int offset = trunc(i/8) + j*stride;
char byte = data[offset];
return byte & static_cast<char>(pow(2, i%8));
}
bool setEl(int i, int j, bool val) {
int offset = trunc(i/8) + j*stride;
if(val)
data[offset] |= static_cast<char>(pow(2, i%8));
else
data[offset] &= static_cast<char>(255-pow(2, i%8));
}
bool is0RowGood;
bool is0ColGood;
char* data;
int side;
int stride;
};
int
main( int argc,
const char** argv ) {
if(argc < 2) {
printf("give n as arg\n");
exit(1);
}
time_t t;
if(argc == 3)
t = atoi(argv[2]);
else {
t = time(NULL);
printf("t=%d",t);
}
srand (t);
int n = atoi( argv[1] );
printf("n=%d\n",n);
Matrix m(n);
m.process();
}

Fill table with known row's and column's sums

I know the values of sums in rows and columns in a matrix. The matrix is small (max 10x10) and values are in range from 0 to 99.
Is it possible to generate any matrix from this data? I am not interested in all possible combinations. Just one would be fine.
Ex.
Task
sum in columns 2 5 2
sum in rows
7 ? ? ?
0 ? ? ?
2 ? ? ?
Answer
2 4 1
0 0 0
0 1 1
I don't think it is possible, because there is more than one answer. For example,
0 5 2
0 0 0
2 0 0
yields the same row and column sums as the matrix you gave.
int n, m;
int rows[n], cols[m];
int answer[n][m];
while (true) {
boolean found = false;
int row = -1, col = -1;
for (int i = 0; i < n; i++)
for (int j = 0; j < m; j++)
if (rows[i] > 0 && cols[j] > 0 && (found == false || Math.min(rows[i], cols[j]) > Math.min(rows[row], cols[col])) {
found = true;
row = i;
col = j;
}
if (!found)
break;
answer[row][col]++;
rows[row]--;
cols[col]--;
}
How it works: every time we try to use col and row with most left cells.
If an answer exists this code will find one:
int n, m;
int rows[n], cols[m];
int answer[n][m];
int n, m;
int rows[n], cols[m];
int answer[n][m];
for (int i = 0; i < n; i++) {
int need = rows[i];
for (int j = 0; need > 0 && j < m; j++) {
int add = need;
if (add > cols[j])
add = cols[j];
if (add > 99)
add = 99;
answer[i][j] = add;
need -= add;
cols[j] -= add;
}
}

Resources