shell sort in openmp - sorting

Is anyone familiar with openmp, I don't get a sorted list. what am I doing wrong. I am using critical at the end so only one thread can access that section when it's been sorted. I guess my private values are not correct. Should they even be there or am I better off with just #pragma omp for.
void shellsort(int a[])
{
int i, j, k, m, temp;
omp_set_num_threads(10);
for(m = 2; m > 0; m = m/2)
{
#pragma omp parallel for private (j, m)
for(j = m; j < 100; j++)
{
#pragma omp critical
for(i = j-m; i >= 0; i = i-m)
{
if(a[i+m] >= a[i])
break;
else
{
temp = a[i];
a[i] = a[i+m];
a[i+m] = temp;
}
}
}
}
}

So there's a number of issues here.
So first, as has been pointed out, i and j (and temp) need to be private; m and a need to be shared. A useful thing to do with openmp is to use default(none), that way you are forced to think through what each variable you use in the parallel section does, and what it needs to be. So this
#pragma omp parallel for private (i,j,temp) shared(a,m) default(none)
is a good start. Making m private in particular is a bit of a disaster, because it means that m is undefined inside the parallel region. The loop, by the way, should start with m = n/2, not m=2.
In addition, you don't need the critical region -- or you shouldn't, for a shell sort. The issue, we'll see in a second, is not so much multiple threads working on the same elements. So if you get rid of those things, you end up with something that almost works, but not always. And that brings us to the more fundamental problem.
The way a shell sort works is, basically, you break the array up into many (here, m) subarrays, and insertion-sort them (very fast for small arrays), and then reassemble; then continue by breaking them up into fewer and fewer subarrays and insertion sort (very fast, because they're partly sorted). Sorting those many subarrays is somethign that can be done in parallel. (In practice, memory contention will be a problem with this simple approach, but still).
Now, the code you've got does that in serial, but it can't be counted on to work if you just wrap the j loop in an omp parallel for. The reason is that each iteration through the j loop does one step of one of the insertion sorts. The j+m'th loop iteration does the next step. But there's no guarantee that they're done by the same thread, or in order! If another thread has already done the j+m'th iteration before the first does the j'th, then the insertion sort is messed up and the sort fails.
So the way to make this work is to rewrite the shell sort to make the parallelism more explicit - to not break up the insertion sort into a bunch of serial steps.
#include <stdlib.h>
#include <stdio.h>
#include <sys/time.h>
void insertionsort(int a[], int n, int stride) {
for (int j=stride; j<n; j+=stride) {
int key = a[j];
int i = j - stride;
while (i >= 0 && a[i] > key) {
a[i+stride] = a[i];
i-=stride;
}
a[i+stride] = key;
}
}
void shellsort(int a[], int n)
{
int i, m;
for(m = n/2; m > 0; m /= 2)
{
#pragma omp parallel for shared(a,m,n) private (i) default(none)
for(i = 0; i < m; i++)
insertionsort(&(a[i]), n-i, m);
}
}
void printlist(char *s, int a[], int n) {
printf("%s\n",s);
for (int i=0; i<n; i++) {
printf("%d ", a[i]);
}
printf("\n");
}
int checklist(int a[], int n) {
int result = 0;
for (int i=0; i<n; i++) {
if (a[i] != i) {
result++;
}
}
return result;
}
void seedprng() {
struct timeval t;
/* seed prng */
gettimeofday(&t, NULL);
srand((unsigned int)(1000000*(t.tv_sec)+t.tv_usec));
}
int main(int argc, char **argv) {
const int n=100;
int *data;
int missorted;
data = (int *)malloc(n*sizeof(int));
for (int i=0; i<n; i++)
data[i] = i;
seedprng();
/* shuffle */
for (int i=0; i<n; i++) {
int i1 = rand() % n;
int i2 = rand() % n;
int tmp = data[i1];
data[i1] = data[i2];
data[i2] = tmp;
}
printlist("Unsorted List:",data,n);
shellsort2(data,n);
printlist("Sorted List:",data,n);
missorted = checklist(data,n);
if (missorted != 0) printf("%d missorted nubmers\n",missorted);
return 0;
}

Variables "j" and "i" need to be declared private on the parallel region. As it is now, I am surprised anything is happening, because "m" can not be private. The critical region is allowing it to work for the "i" loop, but the critical region should be able to be reduced - though I haven't done a shell sort in a while.

Related

Finding number of occurrences of elements in array and printing them in in ascending order of elements

So, like the question says I want to arrange the occurrences of these elements in an ascending order of those elements. For example- if I input 7-3 times and 3-2 times, then output should be printed with 3-2 first and then (next line) 7-3. If you see the for loop with the comment to sort through the array, without that for loop the code works fine but doesn't print the elements in an ascending order. Let me know what you think about this and why that for loop isn't working?
#include<stdio.h>
int x;
int main()
{ int a[10000],b[10000],i,j,n,x,c=0,p,q ;
scanf("%d",&n);
for(i=0; i<n; i++)
{
scanf("%d",&a[i]);
}
for(i=0; i<n; i++)
{ c=1;
if(a[i]!=-1)
{ for(j=i+1; j<n; j++)
{
if(a[i]==a[j])
{ c++;
a[j]=-1;
}
}
b[i]=c;
}
}
for (i = 0; i < n; ++i) \\for loop to sort a[i] elements in ascending order
{ for (j = i + 1; j < n; ++j)
{
if (a[i] > a[j])
{
x = a[i];
a[i] = a[j];
a[j] = x;
}
}
}
for(i=0; i<n; i++)
{
if(a[i]!=-1 && b[i]>1)
{
printf("%d-%d\n",a[i],b[i]);
}
}
return 0;
}
You can do it either in O(n * lg n) e.g. using sorting or in expected linear time using std::map, I'm not sure if there is something like this in C.
Example impl. w/ sorting:
#include <iostream>
#include <vector>
#include <algorithm>
std::vector<int> v = {3,7,7,7,3,7,7};
std::sort(v.begin(), v.end());
for (int i = 0; i < v.size(); ++i) {
int number = v[i];
int count = 1;
while (v[i + count] == number) ++count;
i = i + count;
std::cout << number << " " << count << std::endl;
}
If you know that range of elements in the array is small enough you can use radix sort and so get it done in linear time.
About your implementation.
You are good with the first loop.
In the second loop, you need to take into account -1 entries. Also you need to swap not only a but b entries as well.
Check for b[i] equals to 1. You can initialize it to 0 before c=1; and drop b[i] > 1 check.
Few more comments not related to correctness. Do not use magic number -1, give it a name, and then use it. Do not declare all variables at the beginning of the function, declare every variable as close as possible to the first use.

Finding maximum difference b/w index in an array, with constraint a[i]<=a[j] where i<j

Here is my code, showing the wrong answer on a few test cases, can anyone tell me where it's failing.
I am not able to figure it out even after multiple attempts.
#include <iostream>
using namespace std;
int main() {
//code
int t,n;
cin >> t;
while(t--)
{
cin >> n;
long long int a[n],max=0;
for(int i=0;i<n;i++)
cin >> a[i];
int i=0,j=n-1;
while(i<=j)
{
if(a[j]>=a[i]){
max=j-i; break;}
else if(a[j-1]>=a[i] || a[j]>=a[i+1])
{ max=j-i-1; break;}
else
i++;
j--;
}
cout << max<<"\n";
}
return 0;
}
There is a solution in O(n log n):
Create a vector of index = 0 1 2 ... n-1
Sort (in a stable way) the indices i, j such that a[i] < a[i]
Determine the max_index values
max_index[i]= max (index[j], j >= i)
This can be calculated in a recursive way O(n)
For each index[i], determine index_max[i+1] - ind[i]); and determine the max of them
The maximum we obtained is the value we are looking for.
#include <iostream>
#include <vector>
#include <numeric>
#include <algorithm>
int diff_max (const std::vector<long long int> &a) {
int n = a.size();
std::vector<int> index(n), index_max(n);
int dmax = 0;
std::iota (index.begin(), index.end(), 0);
std::stable_sort (index.begin(), index.end(), [&a] (int i, int j) {return a[i] < a[j];});
index_max[n-1] = index[n-1];
for (int i = n-2; i >= 0; --i) {
index_max[i] = std::max (index_max[i+1], index[i]);
}
for (int i = 0; i < n-1; ++i) {
dmax = std::max (dmax, index_max[i+1] - index[i]);
}
return dmax;
}
int main() {
int t, n;
std::cin >> t;
while(t--) {
std::cin >> n;
std::vector<long long int> a(n);
for (int i = 0; i < n; ++i)
std::cin >> a[i];
auto max = diff_max (a);
std::cout << max << "\n";
}
return 0;
}
One known case where the algorithm fails:
1
5
5 7 6 2 3
The output, in this case, is 0, but it should be 2.
If the first two if conditions are not satisfied then you are incrementing i, here you are only comparing i with j and j-1, but there can be some other value of k such that k < j-1 and (i,j) is the answer.

Pset3 Sorting-sort. Program doesn't execute through all the code lines

I was rewriting the sort function in helper.c. The program compiled, but upon running, it froze. To simplify the problem, I've separated the sort function into another program with a certain set of ints to be sorted. Still, the execution froze at the sorting loop.
Can anyone help?
#include <cs50.h>
#include <stdio.h>
void sort(int values[], int n);
int main(void)
{
int randoms[] = {5,486,48,89,78,164,57,54,9};
sort(randoms, 9);
for(int i = 0; i < 9; i++)
{
printf("%i\n", randoms[i]);
}
}
void sort(int values[], int n)
{
// TODO: implement an O(n^2) sorting algorithm
for(int i = 0, counter; i < n - 1; i++)
{
counter = 0;
for(int j = 0, extra; j < n; i++)
{
//swap adjacent values in wrong order
if (values[j] > values[j+1])
{
extra = values[j];
values[j] = values[j+1];
values[j+1] = extra;
counter++;
}
}
if (counter == 0)
break;
}
return;
}
In your second for loop, you are increasing the wrong variable, it should be j++ instead of i++.

PGI Compiler Parallelization +=

I am working on getting a vector and matrix class parallelized and have run into an issue. Any time I have a loop in the form of
for (int i = 0; i < n; i++)
b[i] += a[i] ;
the code has a data dependency and will not parallelize. When working with the intel compiler it is smart enough to handle this without any pragmas (I would like to avoid the pragma for no dependency check just due to the vast number of loops similar to this and because the cases are actually more complicated than this and I would like it to check just in case one does exist).
Does anyone know of a compiler flag for the PGI compiler that would allow this?
Thank you,
Justin
edit: Error in the for loop. Wasn't copy pasting an actual loop
I think the problem is you're not using the restrict keyword in these routines, so the C compiler has to worry about pointer aliasing.
Compiling this program:
#include <stdlib.h>
#include <stdio.h>
void dbpa(double *b, double *a, const int n) {
for (int i = 0; i < n; i++) b[i] += a[i] ;
return;
}
void dbpa_restrict(double *restrict b, double *restrict a, const int n) {
for (int i = 0; i < n; i++) b[i] += a[i] ;
return;
}
int main(int argc, char **argv) {
const int n=10000;
double *a = malloc(n*sizeof(double));
double *b = malloc(n*sizeof(double));
for (int i=0; i<n; i++) {
a[i] = 1;
b[i] = 2;
}
dbpa(b, a, n);
double error = 0.;
for (int i=0; i<n; i++)
error += (3 - b[i]);
if (error < 0.1)
printf("Success\n");
dbpa_restrict(b, a, n);
error = 0.;
for (int i=0; i<n; i++)
error += (4 - b[i]);
if (error < 0.1)
printf("Success\n");
free(b);
free(a);
return 0;
}
with the PGI compiler:
$ pgcc -o tryautop tryautop.c -Mconcur -Mvect -Minfo
dbpa:
5, Loop not vectorized: data dependency
dbpa_restrict:
11, Parallel code generated with block distribution for inner loop if trip count is greater than or equal to 100
main:
21, Loop not vectorized: data dependency
28, Loop not parallelized: may not be beneficial
36, Loop not parallelized: may not be beneficial
gives us the information that the dbpa() routine without the restrict keyword wasn't parallelized, but the dbpa_restict() routine was.
Really, for this sort of stuff, though, you're better off just using OpenMP (or TBB or ABB or...) rather than trying to convince the compiler to autoparallelize for you; probably better still is just to use existing linear algebra packages, either dense or sparse, depending on what you're doing.

gcc openmp thread reuse

I am using gcc's implementation of openmp to try to parallelize a program. Basically the assignment is to add omp pragmas to obtain speedup on a program that finds amicable numbers.
The original serial program was given(shown below except for the 3 lines I added with comments at the end). We have to parallize first just the outer loop, then just the inner loop. The outer loop was easy and I get close to ideal speedup for a given number of processors. For the inner loop, I get much worse performance than the original serial program. Basically what I am trying to do is a reduction on the sum variable.
Looking at the cpu usage, I am only using ~30% per core. What could be causing this? Is the program continually making new threads everytime it hits the omp parallel for clause? Is there just so much more overhead in doing a barrier for the reduction? Or could it be memory access issue(eg cache thrashing)? From what I read with most implementations of openmp threads get reused overtime(eg pooled), so I am not so sure the first problem is what is wrong.
#include<stdio.h>
#include<stdlib.h>
#include<math.h>
#include <omp.h>
#define numThread 2
int main(int argc, char* argv[]) {
int ser[29], end, i, j, a, limit, als;
als = atoi(argv[1]);
limit = atoi(argv[2]);
for (i = 2; i < limit; i++) {
ser[0] = i;
for (a = 1; a <= als; a++) {
ser[a] = 1;
int prev = ser[a-1];
if ((prev > i) || (a == 1)) {
end = sqrt(prev);
int sum = 0;//added this
#pragma omp parallel for reduction(+:sum) num_threads(numThread)//added this
for (j = 2; j <= end; j++) {
if (prev % j == 0) {
sum += j;
sum += prev / j;
}
}
ser[a] = sum + 1;//added this
}
}
if (ser[als] == i) {
printf("%d", i);
for (j = 1; j < als; j++) {
printf(", %d", ser[j]);
}
printf("\n");
}
}
}
OpenMP thread teams are instantiated on entering the parallel section. This means, indeed, that the thread creation is repeated every time the inner loop is starting.
To enable reuse of threads, use a larger parallel section (to control the lifetime of the team) and specificly control the parallellism for the outer/inner loops, like so:
Execution time for test.exe 1 1000000 has gone down from 43s to 22s using this fix (and the number of threads reflects the numThreads defined value + 1
PS Perhaps stating the obvious, it would not appear that parallelizing the inner loop is a sound performance measure. But that is likely the whole point to this exercise, and I won't critique the question for that.
#include<stdio.h>
#include<stdlib.h>
#include<math.h>
#include <omp.h>
#define numThread 2
int main(int argc, char* argv[]) {
int ser[29], end, i, j, a, limit, als;
als = atoi(argv[1]);
limit = atoi(argv[2]);
#pragma omp parallel num_threads(numThread)
{
#pragma omp single
for (i = 2; i < limit; i++) {
ser[0] = i;
for (a = 1; a <= als; a++) {
ser[a] = 1;
int prev = ser[a-1];
if ((prev > i) || (a == 1)) {
end = sqrt(prev);
int sum = 0;//added this
#pragma omp parallel for reduction(+:sum) //added this
for (j = 2; j <= end; j++) {
if (prev % j == 0) {
sum += j;
sum += prev / j;
}
}
ser[a] = sum + 1;//added this
}
}
if (ser[als] == i) {
printf("%d", i);
for (j = 1; j < als; j++) {
printf(", %d", ser[j]);
}
printf("\n");
}
}
}
}

Resources