How to distribute teams on GPU using OpenMP? - parallel-processing

i'm trying to utilize my Nvidia Geforce GT 740M for parallel-programming using OpenMP and the clang-3.8 compiler.
When processed in parallel on the CPU, I manage to get the desired result. However, when processed on the GPU, my results are some almost random numbers.
Therefore, I figured that I'm not correctly distributing my thread teams and that there might be some data races. I guess I have to do my for-loops differently but I have no idea where the mistake could be.
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
int main(int argc, char* argv[])
{
const int n =100; float a = 3.0f; float b = 2.0f;
float *x = (float *) malloc(n * sizeof(float));
float *y = (float *) malloc(n * sizeof(float));
int i;
int j;
int k;
double start;
double end;
start = omp_get_wtime();
for (k=0; k<n; k++){
x[k] = 2.0f;
y[k] = 3.0f;
}
#pragma omp target data map(to:x[0:n]) map(tofrom:y[0:n]) map(to:i) map(to:j)
{
#pragma omp target teams
#pragma omp distribute
for(i = 0; i < n; i++) {
#pragma omp parallel for
for (j = 0; j < n; j++){
y[j] = a*x[j] + y[j];
}
}
}
end = omp_get_wtime();
printf("Work took %f seconds.\n", end - start);
free(x); free(y);
return 0;
}
I guess that it might have something to to with the Architecture of my GPU. So therefore I'm adding this:
Im fairly new to the topic, so thanks for your help :)

Yes, there is a race here. Different teams are reading and writing to the same element of the array 'y'. Perhaps you want something like this?
for(i = 0; i < n; i++) {
#pragma omp target teams distribute parallel for
for (j = 0; j < n; j++){
y[j] = a*x[j] + y[j];
}
}

Related

Strange benchmark results for open mp methods

My benchmark results are very strange. On one hand I have the serial function for calculating the quadratic form. On the other hand I wrote two parallel versions. For one thread all functions should need more or less the same running time. But one parallel function just needs half of the time. Is there a "hidden" optimization?
Serial version:
double quadratic_form_serial(const std::vector<double> & A,const std::vector<double> & v, const std::vector<double> & w){
int N= v.size();
volatile double q=0.0;
for(int i=0; i<N; ++i)
for(int j=0; j<N; ++j)
q +=v[i]*A[i*N+j]*w[j];
return q;
}
Parallel version 1:
double quadratic_form_parallel(const std::vector<double> & A,const std::vector<double> & v, const std::vector<double> & w, const int threadnum){
int N= v.size();
omp_set_num_threads(threadnum);
volatile double q[threadnum];
volatile double val = 0.0;
#pragma omp parallel
{
int me = omp_get_thread_num();
q[me] = 0.0;
#pragma omp for collapse(2)
for(int i=0; i<N; ++i)
for(int j=0; j<N; ++j)
q[me]+=v[i]*A[i*N+j]*w[j];
#pragma omp atomic
val+=q[me];
}
return val;
}
Parallel version 2:
double quadratic_form_parallel2(const std::vector<double> & A,const std::vector<double> & v, const std::vector<double> & w, const int threadnum){
int N= v.size();
volatile double result =0.0;
omp_set_num_threads(threadnum);
#pragma omp parallel for reduction(+: result)
for (int i=0; i<N; ++i)
for (int j=0; j<N; ++j)
result += v[i] * A[i*N + j] * w[j];
return result;
}
I run the code for N=10000 and I flushed the cache before I call the function. The function quadratic_form_parallel2 needs with one thread less than the half of the time the two other function needed:
threads serial Parallel1 Parallel2
1 0.0882503 0.0875649 0.0313441
Most likely this is the result of result being a reduction variable in the second OpenMP version. This means, that each thread gets a private copy of result that is merged after the parallel region. This private copy probably does not respect the volatile limitations and thus can be optimized more. I assume the detailed interaction between volatile and private are unspecified.
This shows, that marking a variable as volatile - presumably to avoid optimizing away the entire code - is a bad idea. Instead just output the result.

Number Of Pi in Parallel Programming Openmp

Hello everyone i wanted to calculate number of pi in openmp but something is wrong. Could you please tell me which part did i do wrong?
As you see in the below the time suppose to decrease but it doesn't.
#include <stdio.h>
#include <omp.h>
#define MAX_THREADS 4
static long num_steps = 100000000;
double step;
int main()
{
int i, j;
double pi, full_sum = 0.0;
double start_time, run_time;
double sum[MAX_THREADS];
step = 1.0 / (double)num_steps;
for (j = 1; j <= MAX_THREADS; j++){
omp_set_num_threads(j);
full_sum = 0.0;
start_time = omp_get_wtime();
#pragma omp parallel private(i)
{
int id = omp_get_thread_num();
int numthreads = omp_get_num_threads();
double x;
double partial_sum = 0;
#pragma omp single
printf(" num_threads = %d", numthreads);
for (i = id; i< num_steps; i += numthreads){
x = (i + 0.5)*step;
partial_sum += +4.0 / (1.0 + x*x);
}
#pragma omp critical
full_sum += partial_sum;
}
pi = step * full_sum;
run_time = omp_get_wtime() - start_time;
printf("\n pi is %f in %f seconds %d threds \n ", pi, run_time, j);
}
}

OpenMP: sum reduction -- incorrect result if not using +=

I have the code as following:
#include <stdio.h>
#include <omp.h>
#define N 10
double x[N];
int main(void)
{
double sum = 0.0;
#pragma omp parallel
{
#pragma omp for
for (int i = 0; i < N; ++ i)
x[i] = (double) i;
#pragma omp for reduction(+:sum)
for (int i = 0; i < N; ++ i)
{
sum += (double) x[i];
}
}
printf("%le\n", sum);
return 0;
}
The output is 45, which makes sense. However, if I replace sum += (double) x[i] with sum = (double) x[i], the output is then always 43. Could anyone explain why this happens?
Thanks!
If you replace sum += (double) x[i] by sum = (double) x[i], you are not acummulating the result, hence, sum will store the value of the last iteration.
I think I know what's happening here. I thought sum was copied for each ITERATION, but actually sum is copied for each THREAD. I have 8 threads running, so one of them has lost its initial value. Solved. Thanks for all your answers here!

Fill histograms (array reduction) in parallel with OpenMP without using a critical section

I would like to fill histograms in parallel using OpenMP. I have come up with two different methods of doing this with OpenMP in C/C++.
The first method proccess_data_v1 makes a private histogram variable hist_private for each thread, fills them in prallel, and then sums the private histograms into the shared histogram hist in a critical section.
The second method proccess_data_v2 makes a shared array of histograms with array size equal to the number of threads, fills this array in parallel, and then sums the shared histogram hist in parallel.
The second method seems superior to me since it avoids a critical section and sums the histograms in parallel. However, it requires knowing the number of threads and calling omp_get_thread_num(). I generally try to avoid this. Is there better way to do the second method without referencing the thread numbers and using a shared array with size equal to the number of threads?
void proccess_data_v1(float *data, int *hist, const int n, const int nbins, float max) {
#pragma omp parallel
{
int *hist_private = new int[nbins];
for(int i=0; i<nbins; i++) hist_private[i] = 0;
#pragma omp for nowait
for(int i=0; i<n; i++) {
float x = reconstruct_data(data[i]);
fill_hist(hist_private, nbins, max, x);
}
#pragma omp critical
{
for(int i=0; i<nbins; i++) {
hist[i] += hist_private[i];
}
}
delete[] hist_private;
}
}
void proccess_data_v2(float *data, int *hist, const int n, const int nbins, float max) {
const int nthreads = 8;
omp_set_num_threads(nthreads);
int *hista = new int[nbins*nthreads];
#pragma omp parallel
{
const int ithread = omp_get_thread_num();
for(int i=0; i<nbins; i++) hista[nbins*ithread+i] = 0;
#pragma omp for
for(int i=0; i<n; i++) {
float x = reconstruct_data(data[i]);
fill_hist(&hista[nbins*ithread], nbins, max, x);
}
#pragma omp for
for(int i=0; i<nbins; i++) {
for(int t=0; t<nthreads; t++) {
hist[i] += hista[nbins*t + i];
}
}
}
delete[] hista;
}
Based on a suggestion by #HristoIliev I have created an improved method called process_data_v3:
#define ROUND_DOWN(x, s) ((x) & ~((s)-1))
void proccess_data_v2(float *data, int *hist, const int n, const int nbins, float max) {
int* hista;
#pragma omp parallel
{
const int nthreads = omp_get_num_threads();
const int ithread = omp_get_thread_num();
int lda = ROUND_DOWN(nbins+1023, 1024); //1024 ints = 4096 bytes -> round to a multiple of page size
#pragma omp single
hista = (int*)_mm_malloc(lda*sizeof(int)*nthreads, 4096); //align memory to page size
for(int i=0; i<nbins; i++) hista[lda*ithread+i] = 0;
#pragma omp for
for(int i=0; i<n; i++) {
float x = reconstruct_data(data[i]);
fill_hist(&hista[lda*ithread], nbins, max, x);
}
#pragma omp for
for(int i=0; i<nbins; i++) {
for(int t=0; t<nthreads; t++) {
hist[i] += hista[lda*t + i];
}
}
}
_mm_free(hista);
}
You could allocate the big array inside the parallel region, where you can query about the actual number of threads being used:
int *hista;
#pragma omp parallel
{
const int nthreads = omp_get_num_threads();
const int ithread = omp_get_thread_num();
#pragma omp single
hista = new int[nbins*nthreads];
...
}
delete[] hista;
For better performance I would advise that you round the size of each thread's chunk in hista to a multiple of the system's memory page size, even if this could potentially leave holes between the different partial histograms. This way you will prevent both false sharing and remote memory access on NUMA systems (but not in the final reduction phase).

gcc openmp thread reuse

I am using gcc's implementation of openmp to try to parallelize a program. Basically the assignment is to add omp pragmas to obtain speedup on a program that finds amicable numbers.
The original serial program was given(shown below except for the 3 lines I added with comments at the end). We have to parallize first just the outer loop, then just the inner loop. The outer loop was easy and I get close to ideal speedup for a given number of processors. For the inner loop, I get much worse performance than the original serial program. Basically what I am trying to do is a reduction on the sum variable.
Looking at the cpu usage, I am only using ~30% per core. What could be causing this? Is the program continually making new threads everytime it hits the omp parallel for clause? Is there just so much more overhead in doing a barrier for the reduction? Or could it be memory access issue(eg cache thrashing)? From what I read with most implementations of openmp threads get reused overtime(eg pooled), so I am not so sure the first problem is what is wrong.
#include<stdio.h>
#include<stdlib.h>
#include<math.h>
#include <omp.h>
#define numThread 2
int main(int argc, char* argv[]) {
int ser[29], end, i, j, a, limit, als;
als = atoi(argv[1]);
limit = atoi(argv[2]);
for (i = 2; i < limit; i++) {
ser[0] = i;
for (a = 1; a <= als; a++) {
ser[a] = 1;
int prev = ser[a-1];
if ((prev > i) || (a == 1)) {
end = sqrt(prev);
int sum = 0;//added this
#pragma omp parallel for reduction(+:sum) num_threads(numThread)//added this
for (j = 2; j <= end; j++) {
if (prev % j == 0) {
sum += j;
sum += prev / j;
}
}
ser[a] = sum + 1;//added this
}
}
if (ser[als] == i) {
printf("%d", i);
for (j = 1; j < als; j++) {
printf(", %d", ser[j]);
}
printf("\n");
}
}
}
OpenMP thread teams are instantiated on entering the parallel section. This means, indeed, that the thread creation is repeated every time the inner loop is starting.
To enable reuse of threads, use a larger parallel section (to control the lifetime of the team) and specificly control the parallellism for the outer/inner loops, like so:
Execution time for test.exe 1 1000000 has gone down from 43s to 22s using this fix (and the number of threads reflects the numThreads defined value + 1
PS Perhaps stating the obvious, it would not appear that parallelizing the inner loop is a sound performance measure. But that is likely the whole point to this exercise, and I won't critique the question for that.
#include<stdio.h>
#include<stdlib.h>
#include<math.h>
#include <omp.h>
#define numThread 2
int main(int argc, char* argv[]) {
int ser[29], end, i, j, a, limit, als;
als = atoi(argv[1]);
limit = atoi(argv[2]);
#pragma omp parallel num_threads(numThread)
{
#pragma omp single
for (i = 2; i < limit; i++) {
ser[0] = i;
for (a = 1; a <= als; a++) {
ser[a] = 1;
int prev = ser[a-1];
if ((prev > i) || (a == 1)) {
end = sqrt(prev);
int sum = 0;//added this
#pragma omp parallel for reduction(+:sum) //added this
for (j = 2; j <= end; j++) {
if (prev % j == 0) {
sum += j;
sum += prev / j;
}
}
ser[a] = sum + 1;//added this
}
}
if (ser[als] == i) {
printf("%d", i);
for (j = 1; j < als; j++) {
printf(", %d", ser[j]);
}
printf("\n");
}
}
}
}

Resources