MPI communication delay - does the size of the message matter?

MPI communication delay - does the size of the message matter? - time

I've been working with time measuring (benchmarking) in parallel algorithms, more specific, matrix multiplication. I'm using the following algorithm:
if(taskid==MASTER) {
averow = NRA/numworkers;
extra = NRA%numworkers;
offset = 0;
mtype = FROM_MASTER;
for (dest=1; dest<=numworkers; dest++)
{
rows = (dest <= extra) ? averow+1 : averow;
MPI_Send(&offset, 1, MPI_INT, dest, mtype, MPI_COMM_WORLD);
MPI_Send(&rows, 1, MPI_INT, dest, mtype, MPI_COMM_WORLD);
MPI_Send(&a[offset][0], rows*NCA, MPI_DOUBLE, dest, mtype,MPI_COMM_WORLD);
MPI_Send(&b, NCA*NCB, MPI_DOUBLE, dest, mtype, MPI_COMM_WORLD);
offset = offset + rows;
}
mtype = FROM_WORKER;
for (i=1; i<=numworkers; i++)
{
source = i;
MPI_Recv(&offset, 1, MPI_INT, source, mtype, MPI_COMM_WORLD, &status);
MPI_Recv(&rows, 1, MPI_INT, source, mtype, MPI_COMM_WORLD, &status);
MPI_Recv(&c[offset][0], rows*NCB, MPI_DOUBLE, source, mtype,
MPI_COMM_WORLD, &status);
printf("Resultados recebidos do processo %d\n",source);
}
}
else {
mtype = FROM_MASTER;
MPI_Recv(&offset, 1, MPI_INT, MASTER, mtype, MPI_COMM_WORLD, &status);
MPI_Recv(&rows, 1, MPI_INT, MASTER, mtype, MPI_COMM_WORLD, &status);
MPI_Recv(&a, rows*NCA, MPI_DOUBLE, MASTER, mtype, MPI_COMM_WORLD, &status);
MPI_Recv(&b, NCA*NCB, MPI_DOUBLE, MASTER, mtype, MPI_COMM_WORLD, &status);
for (k=0; k<NCB; k++)
for (i=0; i<rows; i++)
{
c[i][k] = 0.0;
for (j=0; j<NCA; j++)
c[i][k] = c[i][k] + a[i][j] * b[j][k];
}
mtype = FROM_WORKER;
MPI_Send(&offset, 1, MPI_INT, MASTER, mtype, MPI_COMM_WORLD);
MPI_Send(&rows, 1, MPI_INT, MASTER, mtype, MPI_COMM_WORLD);
MPI_Send(&c, rows*NCB, MPI_DOUBLE, MASTER, mtype, MPI_COMM_WORLD);
}
I noticed that, for square matrices it took less time than for rectangular ones.
For example: if I use 4 nodes (one as master) and A is 500x500 and B is 500x500, the number of iterations per node equals 41.5 million, while if A is 2400000x6 and B is 6x6, it iterates 28.8 million times per node. Although the second case takes less iterations, it took about 1.00 second, while the first took only about 0.46s.
Logically, the second should be faster, considering it has less iterations per node.
Doing some math, I realized that the MPI sends and receives 83,000 elements per message on the first case, and 4,800,000 elements on the second.
Does the size of the message justify the delay?

The size of messages sent over MPI will definitely affect the performance
of your code. Take a look at THESE graphs posted in one of the popular MPI
implementation's webpage.
As you can see in the first graph, the latency of communication increases
with message size. This trend is applicable to any network and not just InfiniBand
as indicated in this graph.

Related

The relation between MPI_Comm_split and MPI_Send / MPI_Recv

Let's say there are 3 processes, This code works fine:
#include <iostream>
#include <mpi.h>
using namespace std;
int main(){
MPI_Init(NULL, NULL);
int rank; MPI_Comm SubWorld; int buf;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
if (rank == 0){
MPI_Comm_split(MPI_COMM_WORLD, rank, rank, &SubWorld);
MPI_Send(&buf, 1, MPI_INT, 1, 55, MPI_COMM_WORLD);
}
else if (rank == 1){
MPI_Comm_split(MPI_COMM_WORLD, rank, rank, &SubWorld);
MPI_Recv(&buf, 1, MPI_INT, 0, 55, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
}
else MPI_Comm_split(MPI_COMM_WORLD, rank, rank, &SubWorld);
cout << "Done" << endl;
MPI_Finalize();
return 0;
}
Outputs "Done" three times as expected.
But this code has a problem (also 3 processes):
#include <iostream>
#include <mpi.h>
using namespace std;
int main(){
MPI_Init(NULL, NULL);
int rank; MPI_Comm SubWorld; int buf;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
if (rank == 0){
MPI_Comm_split(MPI_COMM_WORLD, rank, rank, &SubWorld);
MPI_Send(&buf, 1, MPI_INT, 1, 55, MPI_COMM_WORLD);
}
else if (rank == 1){
MPI_Recv(&buf, 1, MPI_INT, 0, 55, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
MPI_Comm_split(MPI_COMM_WORLD, rank, rank, &SubWorld);
}
else MPI_Comm_split(MPI_COMM_WORLD, rank, rank, &SubWorld);
cout << "Done" << endl;
MPI_Finalize();
return 0;
}
There is no output!!
What exactly is the relation between MPI_Comm_split and MPI_Send / MPI_Recv which cause this problem?

MPI_Comm_split() is a collective operation, which means all MPI tasks from the initial communicator (e.g. MPI_COMM_WORLD here) must invoke it at the same time.
In your example, rank 1 hangs in MPI_Recv(), and hence MPI_Comm_split() on rank 0 cannot complete (and MPI_Send() is never invoked) and hence the deadlock.
You might consider padb in order to visualize the state of a MPI program, it would make it easy to see where stacks are stuck at.

mpi parallel program to find prime numbers. Please help me dubug

I wrote the following program to find prime number with the #defined value. It is parallel program using mpi. Can anyone help me find a error in it. It compile well but crashes while executing.
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
#define N 65
int rank, size;
double start_time;
double end_time;
int y, x, i, port1, port2, port3;
int check =0; // prime number checker, if a number is prime it always remains 0 through out calculation. for a number which is not prime it is turns to value 1 at some point
int signal =0; // has no important use. just to check if slave process work is done.
MPI_Status status;
MPI_Request request;
int main(int argc, char *argv[]){
MPI_Init(&argc, &argv); //initialize MPI operations
MPI_Comm_rank(MPI_COMM_WORLD, &rank); //get the rank
MPI_Comm_size(MPI_COMM_WORLD, &size); //get number of processes
if(rank == 0){ // master process divides work and also does initial work itself
start_time = MPI_Wtime();
printf("2\n"); //print prime number 2 first because the algorithm for finding the prime number in this program is just for odd number
port1 = (N/(size-1)); // calculating the suitable amount of work per process
for(i=1;i<size-1;i++){ // master sending the portion of work to each slave
port2 = port1 * i; // lower bound of work for i th process
port3 = ((i+1)*port1)-1; // upper bound of work for i th process
MPI_Isend(&port2, 1, MPI_INT, i, 100, MPI_COMM_WORLD, &request);
MPI_Isend(&port3, 1, MPI_INT, i, 101, MPI_COMM_WORLD, &request);
}
port2 = (size-1)*port1; port3= N; // the last process takes the remaining work
MPI_Isend(&port2, 1, MPI_INT, (size-1), 100, MPI_COMM_WORLD, &request);
MPI_Isend(&port3, 1, MPI_INT, (size-1), 101, MPI_COMM_WORLD, &request);
for(x = 3; x < port1; x=x+2){ // master doing initial work by itself
check = 0;
for(y = 3; y <= x/2; y=y+2){
if(x%y == 0) {check =1; break;}
}
if(check==0) printf("%d\n", x);
}
}
if (rank > 0){ // slave working part
MPI_Recv(&port2,1,MPI_INT, 0, 100, MPI_COMM_WORLD, &status);
MPI_Recv(&port3,1,MPI_INT, 0, 101, MPI_COMM_WORLD, &status);
if (port2%2 == 0) port2++; // changing the even argument to odd to make the calculation fast because even number is never a prime except 2.
for(x=port2; x<=port3; x=x+2){
check = 0;
for(y = 3; y <= x/2; y=y+2){
if(x%y == 0) {check =1; break;}
}
if (check==0) printf("%d\n",x);
}
signal= rank;
MPI_Isend(&signal, 1, MPI_INT, 0, 103, MPI_COMM_WORLD, &request); // just informing master that the work is finished
}
if (rank == 0){ // master concluding the work and printing the time taken to do the work
for(i== 1; i < size; i++){
MPI_Recv(&signal,1,MPI_INT, i, 103, MPI_COMM_WORLD, &status); // master confirming that all slaves finished their work
}
end_time = MPI_Wtime();
printf("\nRunning Time = %f \n\n", end_time - start_time);
}
MPI_Finalize();
return 0;
}
I got following error
mpirun -np 2 ./a.exe
Exception: STATUS_ACCESS_VIOLATION at eip=0051401C
End of stack trace

I found what was wrong with my program.
It was the use of the restricted variable signal. change the name of that variable (in all places it is used) to any other viable name and it works.

Will MPI_Send block if matching MPI_IRecv takes less data elements?

Assume the following MPI Code.
MPI_Comm_Rank(MPI_COMM_WORLD, &rank);
if (rank == 0){
MPI_Send(a, count, MPI_CHAR, 1, 0, MPI_COMM_WORLD);
MPI_Send(b, count, MPI_CHAR, 1, 0, MPI_COMM_WORLD);
}
else if (rank == 1){
MPI_IRecv(a, 1, MPI_CHAR, 0, 0, MPI_COMM_WORLD, &req);
MPI_Recv(b, count, MPI_CHAR, 0, 0, MPI_COMM_WORLD);
MPI_Wait(&req, &status);
}
Is it correct to say the the first MPI_Send(a, count, ...) will not block even though its matching MPI_IRecv(a, 1, ...) is only reading one element from the buffer?
Also, since no reads/writes are done to buffer a, is it correct Process 1 will not block even though MPI_Wait is not called directly after MPI_IRecv?
Thanks.

MPI_Send will block...it is a blocking call. The "blocking" call will return when the send/recv buffer can be safely read/modified by the calling application. No guarantees are made about the matching MPI_[I]recv call.
The MPI library does not know anything about the read/write status of the buffers in the application. The MPI standard calls for certain guarantees to be made by the application about the stability of the message buffers.

A warning when debugging a parallel processing in MPI?

I have codes below :
#include <stdio.h>
#include "mpi.h"
#define NRA 512 /* number of rows in matrix A */
#define NCA 512 /* number of columns in matrix A */
#define NCB 512 /* number of columns in matrix B */
#define MASTER 0 /* taskid of first task */
#define FROM_MASTER 1 /* setting a message type */
#define FROM_WORKER 2 /* setting a message type */
MPI_Status status;
double a[NRA][NCA], /* matrix A to be multiplied */
b[NCA][NCB], /* matrix B to be multiplied */
c[NRA][NCB]; /* result matrix C */
main(int argc, char **argv)
{
int numtasks, /* number of tasks in partition */
taskid, /* a task identifier */
numworkers, /* number of worker tasks */
source, /* task id of message source */
dest, /* task id of message destination */
nbytes, /* number of bytes in message */
mtype, /* message type */
intsize, /* size of an integer in bytes */
dbsize, /* size of a double float in bytes */
rows, /* rows of matrix A sent to each worker */
averow, extra, offset, /* used to determine rows sent to each worker */
i, j, k, /* misc */
count;
double t1,t2;
intsize = sizeof(int);
dbsize = sizeof(double);
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &taskid);
MPI_Comm_size(MPI_COMM_WORLD, &numtasks);
numworkers = numtasks-1;
//printf(" size of matrix A = %d by %d\n",NRA,NCA);
//printf(" size of matrix B = %d by %d\n",NRA,NCB);
/*---------------------------- master ----------------------------*/
if (taskid == MASTER) {
printf("Number of worker tasks = %d\n",numworkers);
for (i=0; i<NRA; i++)
for (j=0; j<NCA; j++)
a[i][j]= i+j;
for (i=0; i<NCA; i++)
for (j=0; j<NCB; j++)
b[i][j]= i*j;
t1 = MPI_Wtime();
/* send matrix data to the worker tasks */
averow = NRA/numworkers;
extra = NRA%numworkers;
offset = 0;
mtype = FROM_MASTER;
for (dest=1; dest<=numworkers; dest++) {
rows = (dest <= extra) ? averow+1 : averow;
//printf(" Sending %d rows to task %d\n",rows,dest);
MPI_Send(&offset, 1, MPI_INT, dest, mtype, MPI_COMM_WORLD);
MPI_Send(&rows, 1, MPI_INT, dest, mtype, MPI_COMM_WORLD);
count = rows*NCA;
MPI_Send(&a[offset][0], count, MPI_DOUBLE, dest, mtype, MPI_COMM_WORLD);
count = NCA*NCB;
MPI_Send(&b, count, MPI_DOUBLE, dest, mtype, MPI_COMM_WORLD);
offset = offset + rows;
}
/* wait for results from all worker tasks */
mtype = FROM_WORKER;
for (i=1; i<=numworkers; i++) {
source = i;
MPI_Recv(&offset, 1, MPI_INT, source, mtype, MPI_COMM_WORLD, &status);
MPI_Recv(&rows, 1, MPI_INT, source, mtype, MPI_COMM_WORLD, &status);
count = rows*NCB;
MPI_Recv(&c[offset][0], count, MPI_DOUBLE, source, mtype, MPI_COMM_WORLD,
&status);
}
#ifdef PRINT
printf("Here is the result matrix\n");
for (i=0; i<NRA; i++) {
printf("\n");
for (j=0; j<NCB; j++)
printf("%6.2f ", c[i][j]);
}
printf ("\n");
#endif
t2 = MPI_Wtime();
fprintf(stdout,"Time = %.6f\n\n",
t2-t1);
} /* end of master section */
/*---------------------------- worker (slave)----------------------------*/
if (taskid > MASTER) {
mtype = FROM_MASTER;
source = MASTER;
#ifdef PRINT
printf ("Master =%d, mtype=%d\n", source, mtype);
#endif
MPI_Recv(&offset, 1, MPI_INT, source, mtype, MPI_COMM_WORLD, &status);
#ifdef PRINT
printf ("offset =%d\n", offset);
#endif
MPI_Recv(&rows, 1, MPI_INT, source, mtype, MPI_COMM_WORLD, &status);
#ifdef PRINT
printf ("row =%d\n", rows);
#endif
count = rows*NCA;
MPI_Recv(&a, count, MPI_DOUBLE, source, mtype, MPI_COMM_WORLD, &status);
#ifdef PRINT
printf ("a[0][0] =%e\n", a[0][0]);
#endif
count = NCA*NCB;
MPI_Recv(&b, count, MPI_DOUBLE, source, mtype, MPI_COMM_WORLD, &status);
#ifdef PRINT
printf ("b=\n");
#endif
for (k=0; k<NCB; k++)
for (i=0; i<rows; i++) {
c[i][k] = 0.0;
for (j=0; j<NCA; j++)
c[i][k] = c[i][k] + a[i][j] * b[j][k];
}
//mtype = FROM_WORKER;
#ifdef PRINT
printf ("after computer\n");
#endif
//MPI_Send(&offset, 1, MPI_INT, MASTER, mtype, MPI_COMM_WORLD);
MPI_Send(&offset, 1, MPI_INT, MASTER, FROM_WORKER, MPI_COMM_WORLD);
//MPI_Send(&rows, 1, MPI_INT, MASTER, mtype, MPI_COMM_WORLD);
MPI_Send(&rows, 1, MPI_INT, MASTER, FROM_WORKER, MPI_COMM_WORLD);
//MPI_Send(&c, rows*NCB, MPI_DOUBLE, MASTER, mtype, MPI_COMM_WORLD);
MPI_Send(&c, rows*NCB, MPI_DOUBLE, MASTER, FROM_WORKER, MPI_COMM_WORLD);
#ifdef PRINT
printf ("after send\n");
#endif
} /* end of worker */
MPI_Finalize();
} /* end of main */
The codes are matrix multiplication using MPI. When i try to debug it using visual studio 2010 express : it's display a warning
I want to ask, where was the problem during debugging the code? Does anyone can help me?

The error is at this line:
averow = NRA/numworkers;
numworkers is 0, presumably because you haven't configured Visual Studio to launch the MPI job with more processes. It's pretty fiddly to get it to do the right thing here, especially when debugging.
Make sure the MPI Cluster Debugger is installed correctly - this is the most likely culprit.

Send array of mpz_t over mpi

I use a libgmp (GMP) to work with very long integers, stored as mpz_t: http://gmplib.org/manual/Integer-Internals.html#Integer-Internals
mpz_t variables represent integers using sign and magnitude, in space dynamically allocated and reallocated.
So I think mpz_t is like pointer.
How can I send an array of mpz_t variables with data over MPI?

Use mpz_import() and mpz_export() to convert between mpz_t and e.g. char arrays, which you can then send/receive over MPI. Be careful with getting the parameters related to endianness etc. right.

Here is the code:
unsigned long *buf, *t; // pointers for ulong array for storing gmp data
unsigned long count, countc; // sizes of data and element sizes array
unsigned long size = array_size; // size of array
size_t *bc,*tc; // pointers for size_t array to store element sizes;
buf=(unsigned long*)malloc(sizeof(unsigned long)*(size*limb_per_element));
bc=(size_t*)malloc(sizeof(size_t)*(size));
if(rank==SENDER_RANK) {
t=buf;
tc=bc;
for(int i;i<size;i++) {
mpz_export(t,tc,1,sizeof(unsigned long),0,0, ARRAY(i));
t+=*tc;
tc++;
}
count=t-buf;
countc=tc-bc;
MPI_Send(&count, 1, MPI_UNSIGNED_LONG, 0, 0, MPI_COMM_WORLD);
MPI_Send(&countc, 1, MPI_UNSIGNED_LONG, 0, 0, MPI_COMM_WORLD);
MPI_Send(bc, countc*(sizeof(size_t)), MPI_CHAR, 0, 0, MPI_COMM_WORLD);
MPI_Send(buf, count, MPI_UNSIGNED_LONG, 0, 0, MPI_COMM_WORLD);
} else {
status=MPI_Recv(&count, 1, MPI_UNSIGNED_LONG, SENDER_RANK, 0, MPI_COMM_WORLD, NULL);
status=MPI_Recv(&countc, 1, MPI_UNSIGNED_LONG, SENDER_RANK, 0, MPI_COMM_WORLD, NULL);
t=buf;
tc=bc;
status=MPI_Recv(bc, countc*(sizeof(size_t)), MPI_CHAR, SENDER_RANK, 0, MPI_COMM_WORLD, NULL);
status=MPI_Recv(buf, count, MPI_UNSIGNED_LONG, SENDER_RANK, 0, MPI_COMM_WORLD, NULL);
for(int i; i<size; i++) {
mpz_import(ARRAY(i),*tc,1,sizeof(unsigned long),0,0, t);
t+=*tc;
tc++;
}
}
free(buf);
free(bc);

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

MPI communication delay - does the size of the message matter? - time

Related

The relation between MPI_Comm_split and MPI_Send / MPI_Recv

mpi parallel program to find prime numbers. Please help me dubug

Will MPI_Send block if matching MPI_IRecv takes less data elements?

A warning when debugging a parallel processing in MPI?

Send array of mpz_t over mpi

Categories

Resources