Hybrid Parallel Programming using OpenMPI and OpenMP : OpenMP threads always use 1 out of 4 cores per node - parallel-processing

I have created a home cluster using 3 Raspberry PI boards to learn and experiment with Hybrid Parallel programming.
My setup is as follows:-
1 RPi3 B+ (4 cores ) and 2 RPi4 (4 cores each)and all these nodes are connected over ethernet using a switch. Therefore, in total I have 12 cores in my cluster. RPI3B+ is designated as node1 and rest of the two RPI4 boards as node2 and node 3 respectively.
I have written a simple program using OpenMPI and OpenMP, which sums up two large vectors.
The structure of the code is as follows:-
My intention here is - I would start 3 MPI processes with ranks 0, 1, and 2 and bind them to node1, node2, and node 3 respectively. From the inside of each MPI process, I would start 4 OMP threads which would do their job of summing up the distributed vector chunks and return back the result to rank 0.
Although the code is working and giving the correct result, but my observation is that the OMP threads are not leveraging all the CPU cores present in a node. Using htop, I have noticed that only 1 core on each of the nodes is 100% loaded.
The way I am launching the program is as follows:-
#mpirun --map-by node --display-devel-map --hostfile nodeconfig -np 3 ./addVectorsMPIHybrid 3000000 3000000
The nodeconfig contains just names of the nodes i.e, node1, node2 and node3
Please note: I have verified that 4 omp threads are getting created in the #pragma parallel section.
Can someone please give me pointers - why OMP threads are not running on all the cores?
The code that I am running:
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
#include <assert.h>
#include <omp.h>
int main(int argc, char *argv[])
{
double *v1, *v2, *resulv;
double *v1local, *v2local, *resulvlocal;
double startTime, endTime;
int rank, size;
int local_n;
int n;
int root = 0;
int provided;
if(argc <2)
{
printf("Please provide correct inputs\n");
exit(0);
}
n = atoi(argv[1]);
if(n <=0)
{
printf("Please enter valid size of vector\n");
exit(0);
}
//MPI_Init(NULL, NULL);
MPI_Init_thread(NULL, NULL, MPI_THREAD_FUNNELED, &provided);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
local_n = n/size;
if(rank == 0)
{
v1 = (double *) malloc (sizeof(double) * n);
assert(v1);
v2 = (double *) malloc (sizeof(double) * n);
assert(v2);
resulv = (double *) malloc (sizeof(double) * n);
assert(resulv);
resulv[n] = {0};
for(int i=0; i<n; i++)
v1[i] = v2[i] = (double)i;
}
v1local = (double *) malloc (sizeof(double) * local_n);
assert(v1local);
v2local = (double *) malloc (sizeof(double) * local_n);
assert(v2local);
resulvlocal = (double *) malloc (sizeof(double) * local_n);
assert(resulv);
resulvlocal[local_n] = {0.0};
startTime = MPI_Wtime();
//Scatter v1
MPI_Scatter(v1, local_n,MPI_DOUBLE, v1local, local_n, MPI_DOUBLE, root, MPI_COMM_WORLD);
MPI_Scatter(v2, local_n,MPI_DOUBLE, v2local, local_n, MPI_DOUBLE, root, MPI_COMM_WORLD);
#pragma omp parallel num_threads(4) default(none) shared(resulvlocal, v1local, v2local) firstprivate(local_n)
{
#pragma omp single
{
printf("num_threads = %d\n", omp_get_num_threads());
}
#pragma omp for
for(int i=0; i<local_n; i++)
resulvlocal[i] = v1local[i] + v2local[i];
}
MPI_Gather(resulvlocal, local_n, MPI_DOUBLE, resulv, local_n, MPI_DOUBLE, root, MPI_COMM_WORLD);
endTime = MPI_Wtime();
if(rank == 0)
{
//Display resulv
#if 0
for(int i=0; i<n; i++)
printf("%0.3f ",resulv[i]);
printf("\n");
#endif
printf("Time taken to compute is:%f in sec\n",endTime-startTime);
free(v1);
free(v2);
free(resulv);
}
free(v1local);
free(v2local);
free(resulvlocal);
MPI_Finalize();
return 0;
}

Related

MPI Program using MPI_Scatter and MPI_Reduce

Write an MPI program that efficiently compute the sum of array elements.
Program 1: Tasks communicate with MPI_Scatter and MPI_Reduce.
The programs can assume that the number of processes is a power of two.
The programs should add 2^15 = 65536 random doubles in the range 0 to 100.
Task 0 must generate the numbers, store them in array and distribute them to the tasks.
Each task does a serial sum of the numbers it is assigned. The local sums are then
added together using a tree structured parallel sum.
After the parallel sum is complete, task 0 should compute a serial sum of the
same numbers (to verify the result).
Task 0 must print the parallel sum, the serial sum and the time required for the
parallel sum (including data distribution).
#include <stdio.h>
#include <mpi.h>
int main(int argc,char *argv[]){
MPI_Init(NULL,NULL); // Initialize the MPI environment
int rank;
int comm_size;
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
MPI_Comm_size(MPI_COMM_WORLD,&comm_size);
int number1[2];
int number[4];
if(rank == 0){
number[0]=1;
number[1]=2;
number[2]=3;
number[3]=4;
//number[4]=5;
}
double local_start, local_finish, local_elapsed, elapsed;
MPI_Barrier(MPI_COMM_WORLD);
local_start = MPI_Wtime();
//All processes
MPI_Scatter(number, 2, MPI_INT, &number1, 2, MPI_INT, 0, MPI_COMM_WORLD);
//printf("I'm process %d , I received the array : ",rank);
int sub_sum = 0;
for(int i=0 ; i<2 ; i++){
// printf("%d ",number1[i]);
sub_sum = sub_sum + number1[i];
}
printf("\n");
int sum = 0;
MPI_Reduce(&sub_sum, &sum, 1, MPI_INT, MPI_SUM,0,MPI_COMM_WORLD);
local_finish = MPI_Wtime();
local_elapsed = local_finish -local_start;
MPI_Reduce(&local_elapsed,&elapsed,1,MPI_DOUBLE,MPI_MAX,0,MPI_COMM_WORLD);
if(rank == 0)
{
printf("\nthe sum of array is: %d\n",sum);
printf("Elapsed time = %e seconds\n",elapsed);
}
MPI_Finalize();
return 0;
}

MPI - scattering filepaths to processes

I have 4 filepaths in the global_filetable and I am trying to scatter 2 pilepaths to each process.
The process 0 have proper 2 paths, but there is something strange in the process 1 (null)...
EDIT:
Here's the full code:
#include <stdio.h>
#include <limits.h> // PATH_MAX
#include <mpi.h>
int main(int argc, char *argv[])
{
char** global_filetable = (char**)malloc(4 * PATH_MAX * sizeof(char));
for(int i = 0; i < 4; ++i) {
global_filetable[i] = (char*)malloc(PATH_MAX *sizeof(char));
strncpy (filetable[i], "/path/", PATH_MAX);
}
/*for(int i = 0; i < 4; ++i) {
printf("%s\n", global_filetable[i]);
}*/
int rank, size;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
char** local_filetable = (char**)malloc(2 * PATH_MAX * sizeof(char));
MPI_Scatter(global_filetable, 2*PATH_MAX, MPI_CHAR, local_filetable, 2*PATH_MAX , MPI_CHAR, 0, MPI_COMM_WORLD);
{
/* now all processors print their local data: */
for (int p = 0; p < size; ++p) {
if (rank == p) {
printf("Local process on rank %d is:\n", rank);
for (int i = 0; i < 2; i++) {
printf("path: %s\n", local_filetable[i]);
}
}
MPI_Barrier(MPI_COMM_WORLD);
}
}
MPI_Finalize();
return 0;
}
Output:
Local process on rank 0 is:
path: /path/
path: /path/
Local process on rank 1 is:
path: (null)
path: (null)
Do you have any idea why I am having those nulls?
First, your allocation is inconsistent:
char** local_filetable = (char**)malloc(2 * PATH_MAX * sizeof(char));
The type char** indicates an array of char*, but you allocate a contiguous memory block, which would indicate a char*.
The easiest way would be to use the contiguous memory as char* for both global and local filetables. Depending on what get_filetable() actually does, you may have to convert. You can then index it like this:
char* entry = &filetable[i * PATH_MAX]
You can then simply scatter like this:
MPI_Scatter(global_filetable, 2 * PATH_MAX, MPI_CHAR,
local_filetable, 2 * PATH_MAX, MPI_CHAR, 0, MPI_COMM_WORLD);
Note that there is no more displacement, every rank just gets an equal sized chunk of the contiguous memory.
The next step would be to define a C and MPI struct encapsulating PATH_MAX characters so you can get rid of the constant usage of PATH_MAX and crude indexing.
I think this is much nicer (less complex, less memory management) than using actual char**. You would only need that if memory waste or redundant data transfer becomes an issue.
P.S. Make sure to never put in more than PATH_MAX - 1 characters in an filetable entry to keep space for the tailing \0.
Okay, I'm stupid.
char global_filetable[NUMBER_OF_STRINGS][PATH_MAX];
for(int i = 0; i < 4; ++i) {
strcpy (filetable[i], "/path/");
}
char local_filetable[2][PATH_MAX];
Now it works!

Why does this OpenMP code work on dynamic scheduling but not on static?

I'm learning OpenMP by building a simple program to calculate pi using the following algorithm:
pi = 4/1 - 4/3 + 4/5 - 4/7 + 4/9...
The problem is that it does not work correctly when I change the scheduling to static. It works perfectly when the thread count is one. It also runs correctly under dynamic scheduling despite the result differing slightly every time it's run. Any idea what could be the problem?
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
#define N 100
#define CSIZE 1
#define nthread 2
int pi()
{
int i, chunk;
float pi = 0, x = 1;
chunk = CSIZE;
omp_set_num_threads(nthread);
#pragma omp parallel shared(i, x, chunk)
{
if (omp_get_num_threads() == 0)
{
printf("Number of threads = %d\n", omp_get_num_threads());
}
printf("Thread %d starting...\n", omp_get_thread_num());
#pragma omp for schedule(dynamic, chunk)
for (i = 1; i <= N; i++)
{
if (i % 2 == 0)
pi = pi - 4/x;
else
pi = pi + 4/x;
x = x + 2;
printf("Pi is currently %f at iteration %d with x = %0.0f on thread %d\n",
pi, i, x, omp_get_thread_num());
}
}
return EXIT_SUCCESS;
}
Using printf in the loop when I test your code makes dynamic do all the work on the first thread and none on the second (making the program effectively serial). If you remove the printf statement then you will find that the value of pi is random. This is because you have race conditions in x and pi.
Instead of using x you can divide by 2*i+1 (for i starting at zero). Also instead of using a branch to get the sign you can use sign = -2*(i%2)+1. To get pi you need to do a reduction using #pragma omp for schedule(static) reduction(+:pi).
#include <stdio.h>
#define N 10000
int main() {
float pi;
int i;
pi = 0;
#pragma omp parallel for schedule(static) reduction(+:pi)
for (i = 0; i < N; i++) {
pi += (-2.0f*(i&1)+1)/(2*i+1);
}
pi*=4.0f;
printf("%f\n", pi);
}

Mpi Scatter dynamical allocated 2d array(pgm file image)

I have implemented a 2d array Mpi scatter which works well. I mean that the master processor can scatter 2d parts of the initial big array. The problem is when I use as input the 2d image file dynamically allocated it doesn't work. I suppose that there must be something wrong with the memory. Is there any way of obtaining 2d parts of a big 2d array dynamically.
I had a similar problem, but it was one-dimensional vector with dynamically allocated.
Solved my problem as follows:
#include <stdio.h>
#include "mpi.h"
main(int argc, char** argv) {
/* .......Variables Initialisation ......*/
int Numprocs, MyRank, Root = 0;
int index;
int *InputBuffer, *RecvBuffer;
int Scatter_DataSize;
int DataSize;
MPI_Status status;
/* ........MPI Initialisation .......*/
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &MyRank);
MPI_Comm_size(MPI_COMM_WORLD, &Numprocs);
if (MyRank == Root) {
DataSize = 80000;
/* ...Allocate memory.....*/
InputBuffer = (int*) malloc(DataSize * sizeof(int));
for (index = 0; index < DataSize; index++)
InputBuffer[index] = index;
}
MPI_Bcast(&DataSize, 1, MPI_INT, Root, MPI_COMM_WORLD);
if (DataSize % Numprocs != 0) {
if (MyRank == Root)
printf("Input is not evenly divisible by Number of Processes\n");
MPI_Finalize();
exit(-1);
}
Scatter_DataSize = DataSize / Numprocs;
RecvBuffer = (int *) malloc(Scatter_DataSize * sizeof(int));
MPI_Scatter(InputBuffer, Scatter_DataSize, MPI_INT, RecvBuffer,
Scatter_DataSize, MPI_INT, Root, MPI_COMM_WORLD);
for (index = 0; index < Scatter_DataSize; ++index)
printf("MyRank = %d, RecvBuffer[%d] = %d \n", MyRank, index,
RecvBuffer[index]);
MPI_Finalize();
}
This link has examples that have helped me:
http://www.cse.iitd.ernet.in/~dheerajb/MPI/Document/hos_cont.html
Hope this helps.

Low performance in a OpenMP program

I am trying to understand an openmp code from here. You can see the code below.
In order to measure the speedup, difference between the serial and omp version, I use time.h, do you find right this approach?
The program runs on a 4 core machine. I specify export OMP_NUM_THREADS="4" but can not see substantially speedup, usually I get 1.2 - 1.7. Which problems am I facing in this parallelization?
Which debug/performace tool could I use to see the loss of performace?
code (for compilation I use xlc_r -qsmp=omp omp_workshare1.c -o omp_workshare1.exe)
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
#define CHUNKSIZE 1000000
#define N 100000000
int main (int argc, char *argv[])
{
int nthreads, tid, i, chunk;
float a[N], b[N], c[N];
unsigned long elapsed;
unsigned long elapsed_serial;
unsigned long elapsed_omp;
struct timeval start;
struct timeval stop;
chunk = CHUNKSIZE;
// ================= SERIAL start =======================
/* Some initializations */
for (i=0; i < N; i++)
a[i] = b[i] = i * 1.0;
gettimeofday(&start,NULL);
for (i=0; i<N; i++)
{
c[i] = a[i] + b[i];
//printf("Thread %d: c[%d]= %f\n",tid,i,c[i]);
}
gettimeofday(&stop,NULL);
elapsed = 1000000 * (stop.tv_sec - start.tv_sec);
elapsed += stop.tv_usec - start.tv_usec;
elapsed_serial = elapsed ;
printf (" \n Time SEQ= %lu microsecs\n", elapsed_serial);
// ================= SERIAL end =======================
// ================= OMP start =======================
/* Some initializations */
for (i=0; i < N; i++)
a[i] = b[i] = i * 1.0;
gettimeofday(&start,NULL);
#pragma omp parallel shared(a,b,c,nthreads,chunk) private(i,tid)
{
tid = omp_get_thread_num();
if (tid == 0)
{
nthreads = omp_get_num_threads();
printf("Number of threads = %d\n", nthreads);
}
//printf("Thread %d starting...\n",tid);
#pragma omp for schedule(static,chunk)
for (i=0; i<N; i++)
{
c[i] = a[i] + b[i];
//printf("Thread %d: c[%d]= %f\n",tid,i,c[i]);
}
} /* end of parallel section */
gettimeofday(&stop,NULL);
elapsed = 1000000 * (stop.tv_sec - start.tv_sec);
elapsed += stop.tv_usec - start.tv_usec;
elapsed_omp = elapsed ;
printf (" \n Time OMP= %lu microsecs\n", elapsed_omp);
// ================= OMP end =======================
printf (" \n speedup= %f \n\n", ((float) elapsed_serial) / ((float) elapsed_omp)) ;
}
There's nothing really wrong with the code as above, but your speedup is going to be limited by the fact that the main loop, c=a+b, has very little work -- the time required to do the computation (a single addition) is going to be dominated by memory access time (2 loads and one store), and there's more contention for memory bandwidth with more threads acting on the array.
We can test this by making the work inside the loop more compute-intensive:
c[i] = exp(sin(a[i])) + exp(cos(b[i]));
And then we get
$ ./apb
Time SEQ= 17678571 microsecs
Number of threads = 4
Time OMP= 4703485 microsecs
speedup= 3.758611
which is obviously a lot closer to the 4x speedup one would expect.
Update: Oh, and to the other questions -- gettimeofday() is probably fine for timing, and on a system where you're using xlc - is this AIX? In that case, peekperf is a good overall performance tool, and the hardware performance monitors will give you access to to memory access times. On x86 platforms, free tools for performance monitoring of threaded code include cachegrind/valgrind for cache performance debugging (not the problem here), scalasca for general OpenMP issues, and OpenSpeedShop is pretty useful, too.

Resources