Assume the following MPI Code.
MPI_Comm_Rank(MPI_COMM_WORLD, &rank);
if (rank == 0){
MPI_Send(a, count, MPI_CHAR, 1, 0, MPI_COMM_WORLD);
MPI_Send(b, count, MPI_CHAR, 1, 0, MPI_COMM_WORLD);
}
else if (rank == 1){
MPI_IRecv(a, 1, MPI_CHAR, 0, 0, MPI_COMM_WORLD, &req);
MPI_Recv(b, count, MPI_CHAR, 0, 0, MPI_COMM_WORLD);
MPI_Wait(&req, &status);
}
Is it correct to say the the first MPI_Send(a, count, ...) will not block even though its matching MPI_IRecv(a, 1, ...) is only reading one element from the buffer?
Also, since no reads/writes are done to buffer a, is it correct Process 1 will not block even though MPI_Wait is not called directly after MPI_IRecv?
Thanks.
MPI_Send will block...it is a blocking call. The "blocking" call will return when the send/recv buffer can be safely read/modified by the calling application. No guarantees are made about the matching MPI_[I]recv call.
The MPI library does not know anything about the read/write status of the buffers in the application. The MPI standard calls for certain guarantees to be made by the application about the stability of the message buffers.
Related
I have two matrices that I want to multiply. I want to use MPI with shared memory, but do not understand how I can do this.
I do not understand what I should do in the preparatory stage. If the rank of my stream is 0, then I need to initialize the window, but what if the rank is not 0?
If a window is a shared memory location for all threads, then how can I synchronize its use?
Also, I don't understand how to parallelize the multiplication, I know the number of threads. How should the parameters of the cycle be changed?
I've put together a few examples piece by piece, but I doubt my code is correct. Please give me a hint. Thank you!
real :: a(5,6), b(6,4), result(5,6)
call MPI_INIT(ierror)
call MPI_COMM_SIZE(MPI_COMM_WORLD, size_Of_Cluster, ierror) ! get count of process
call MPI_COMM_RANK(MPI_COMM_WORLD, process_Rank, ierror) ! get if of current process
MPI_Win window;
MPI_Win_create(&result, sizeof(&result), sizeof(&result), MPI_INFO_NULL, MPI_COMM_WORLD, &window);
do i = 1,5
result[i,j] = 0
do j = 1,6
result[i,j] = result[i,j] + a[i,j] * b[j,i]
end do
end do
MPI_Win_fence(0, window);
if(process_Rank> 0)
{
// Push my value into the first integer in MPI process 0 window
MPI_Accumulate(&result, 1, MPI_INT, 0, 0, 1, MPI_INT, MPI_SUM, window);
}
MPI_Win_fence(0, window);
if(my_rank == 0)
{
printf("[MPI process 0] result ");
}
MPI_Win_free(&window);
MPI_Finalize();
Let's say there are 3 processes, This code works fine:
#include <iostream>
#include <mpi.h>
using namespace std;
int main(){
MPI_Init(NULL, NULL);
int rank; MPI_Comm SubWorld; int buf;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
if (rank == 0){
MPI_Comm_split(MPI_COMM_WORLD, rank, rank, &SubWorld);
MPI_Send(&buf, 1, MPI_INT, 1, 55, MPI_COMM_WORLD);
}
else if (rank == 1){
MPI_Comm_split(MPI_COMM_WORLD, rank, rank, &SubWorld);
MPI_Recv(&buf, 1, MPI_INT, 0, 55, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
}
else MPI_Comm_split(MPI_COMM_WORLD, rank, rank, &SubWorld);
cout << "Done" << endl;
MPI_Finalize();
return 0;
}
Outputs "Done" three times as expected.
But this code has a problem (also 3 processes):
#include <iostream>
#include <mpi.h>
using namespace std;
int main(){
MPI_Init(NULL, NULL);
int rank; MPI_Comm SubWorld; int buf;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
if (rank == 0){
MPI_Comm_split(MPI_COMM_WORLD, rank, rank, &SubWorld);
MPI_Send(&buf, 1, MPI_INT, 1, 55, MPI_COMM_WORLD);
}
else if (rank == 1){
MPI_Recv(&buf, 1, MPI_INT, 0, 55, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
MPI_Comm_split(MPI_COMM_WORLD, rank, rank, &SubWorld);
}
else MPI_Comm_split(MPI_COMM_WORLD, rank, rank, &SubWorld);
cout << "Done" << endl;
MPI_Finalize();
return 0;
}
There is no output!!
What exactly is the relation between MPI_Comm_split and MPI_Send / MPI_Recv which cause this problem?
MPI_Comm_split() is a collective operation, which means all MPI tasks from the initial communicator (e.g. MPI_COMM_WORLD here) must invoke it at the same time.
In your example, rank 1 hangs in MPI_Recv(), and hence MPI_Comm_split() on rank 0 cannot complete (and MPI_Send() is never invoked) and hence the deadlock.
You might consider padb in order to visualize the state of a MPI program, it would make it easy to see where stacks are stuck at.
I wrote the following program to find prime number with the #defined value. It is parallel program using mpi. Can anyone help me find a error in it. It compile well but crashes while executing.
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
#define N 65
int rank, size;
double start_time;
double end_time;
int y, x, i, port1, port2, port3;
int check =0; // prime number checker, if a number is prime it always remains 0 through out calculation. for a number which is not prime it is turns to value 1 at some point
int signal =0; // has no important use. just to check if slave process work is done.
MPI_Status status;
MPI_Request request;
int main(int argc, char *argv[]){
MPI_Init(&argc, &argv); //initialize MPI operations
MPI_Comm_rank(MPI_COMM_WORLD, &rank); //get the rank
MPI_Comm_size(MPI_COMM_WORLD, &size); //get number of processes
if(rank == 0){ // master process divides work and also does initial work itself
start_time = MPI_Wtime();
printf("2\n"); //print prime number 2 first because the algorithm for finding the prime number in this program is just for odd number
port1 = (N/(size-1)); // calculating the suitable amount of work per process
for(i=1;i<size-1;i++){ // master sending the portion of work to each slave
port2 = port1 * i; // lower bound of work for i th process
port3 = ((i+1)*port1)-1; // upper bound of work for i th process
MPI_Isend(&port2, 1, MPI_INT, i, 100, MPI_COMM_WORLD, &request);
MPI_Isend(&port3, 1, MPI_INT, i, 101, MPI_COMM_WORLD, &request);
}
port2 = (size-1)*port1; port3= N; // the last process takes the remaining work
MPI_Isend(&port2, 1, MPI_INT, (size-1), 100, MPI_COMM_WORLD, &request);
MPI_Isend(&port3, 1, MPI_INT, (size-1), 101, MPI_COMM_WORLD, &request);
for(x = 3; x < port1; x=x+2){ // master doing initial work by itself
check = 0;
for(y = 3; y <= x/2; y=y+2){
if(x%y == 0) {check =1; break;}
}
if(check==0) printf("%d\n", x);
}
}
if (rank > 0){ // slave working part
MPI_Recv(&port2,1,MPI_INT, 0, 100, MPI_COMM_WORLD, &status);
MPI_Recv(&port3,1,MPI_INT, 0, 101, MPI_COMM_WORLD, &status);
if (port2%2 == 0) port2++; // changing the even argument to odd to make the calculation fast because even number is never a prime except 2.
for(x=port2; x<=port3; x=x+2){
check = 0;
for(y = 3; y <= x/2; y=y+2){
if(x%y == 0) {check =1; break;}
}
if (check==0) printf("%d\n",x);
}
signal= rank;
MPI_Isend(&signal, 1, MPI_INT, 0, 103, MPI_COMM_WORLD, &request); // just informing master that the work is finished
}
if (rank == 0){ // master concluding the work and printing the time taken to do the work
for(i== 1; i < size; i++){
MPI_Recv(&signal,1,MPI_INT, i, 103, MPI_COMM_WORLD, &status); // master confirming that all slaves finished their work
}
end_time = MPI_Wtime();
printf("\nRunning Time = %f \n\n", end_time - start_time);
}
MPI_Finalize();
return 0;
}
I got following error
mpirun -np 2 ./a.exe
Exception: STATUS_ACCESS_VIOLATION at eip=0051401C
End of stack trace
I found what was wrong with my program.
It was the use of the restricted variable signal. change the name of that variable (in all places it is used) to any other viable name and it works.
I've been working with time measuring (benchmarking) in parallel algorithms, more specific, matrix multiplication. I'm using the following algorithm:
if(taskid==MASTER) {
averow = NRA/numworkers;
extra = NRA%numworkers;
offset = 0;
mtype = FROM_MASTER;
for (dest=1; dest<=numworkers; dest++)
{
rows = (dest <= extra) ? averow+1 : averow;
MPI_Send(&offset, 1, MPI_INT, dest, mtype, MPI_COMM_WORLD);
MPI_Send(&rows, 1, MPI_INT, dest, mtype, MPI_COMM_WORLD);
MPI_Send(&a[offset][0], rows*NCA, MPI_DOUBLE, dest, mtype,MPI_COMM_WORLD);
MPI_Send(&b, NCA*NCB, MPI_DOUBLE, dest, mtype, MPI_COMM_WORLD);
offset = offset + rows;
}
mtype = FROM_WORKER;
for (i=1; i<=numworkers; i++)
{
source = i;
MPI_Recv(&offset, 1, MPI_INT, source, mtype, MPI_COMM_WORLD, &status);
MPI_Recv(&rows, 1, MPI_INT, source, mtype, MPI_COMM_WORLD, &status);
MPI_Recv(&c[offset][0], rows*NCB, MPI_DOUBLE, source, mtype,
MPI_COMM_WORLD, &status);
printf("Resultados recebidos do processo %d\n",source);
}
}
else {
mtype = FROM_MASTER;
MPI_Recv(&offset, 1, MPI_INT, MASTER, mtype, MPI_COMM_WORLD, &status);
MPI_Recv(&rows, 1, MPI_INT, MASTER, mtype, MPI_COMM_WORLD, &status);
MPI_Recv(&a, rows*NCA, MPI_DOUBLE, MASTER, mtype, MPI_COMM_WORLD, &status);
MPI_Recv(&b, NCA*NCB, MPI_DOUBLE, MASTER, mtype, MPI_COMM_WORLD, &status);
for (k=0; k<NCB; k++)
for (i=0; i<rows; i++)
{
c[i][k] = 0.0;
for (j=0; j<NCA; j++)
c[i][k] = c[i][k] + a[i][j] * b[j][k];
}
mtype = FROM_WORKER;
MPI_Send(&offset, 1, MPI_INT, MASTER, mtype, MPI_COMM_WORLD);
MPI_Send(&rows, 1, MPI_INT, MASTER, mtype, MPI_COMM_WORLD);
MPI_Send(&c, rows*NCB, MPI_DOUBLE, MASTER, mtype, MPI_COMM_WORLD);
}
I noticed that, for square matrices it took less time than for rectangular ones.
For example: if I use 4 nodes (one as master) and A is 500x500 and B is 500x500, the number of iterations per node equals 41.5 million, while if A is 2400000x6 and B is 6x6, it iterates 28.8 million times per node. Although the second case takes less iterations, it took about 1.00 second, while the first took only about 0.46s.
Logically, the second should be faster, considering it has less iterations per node.
Doing some math, I realized that the MPI sends and receives 83,000 elements per message on the first case, and 4,800,000 elements on the second.
Does the size of the message justify the delay?
The size of messages sent over MPI will definitely affect the performance
of your code. Take a look at THESE graphs posted in one of the popular MPI
implementation's webpage.
As you can see in the first graph, the latency of communication increases
with message size. This trend is applicable to any network and not just InfiniBand
as indicated in this graph.
I use a libgmp (GMP) to work with very long integers, stored as mpz_t: http://gmplib.org/manual/Integer-Internals.html#Integer-Internals
mpz_t variables represent integers using sign and magnitude, in space dynamically allocated and reallocated.
So I think mpz_t is like pointer.
How can I send an array of mpz_t variables with data over MPI?
Use mpz_import() and mpz_export() to convert between mpz_t and e.g. char arrays, which you can then send/receive over MPI. Be careful with getting the parameters related to endianness etc. right.
Here is the code:
unsigned long *buf, *t; // pointers for ulong array for storing gmp data
unsigned long count, countc; // sizes of data and element sizes array
unsigned long size = array_size; // size of array
size_t *bc,*tc; // pointers for size_t array to store element sizes;
buf=(unsigned long*)malloc(sizeof(unsigned long)*(size*limb_per_element));
bc=(size_t*)malloc(sizeof(size_t)*(size));
if(rank==SENDER_RANK) {
t=buf;
tc=bc;
for(int i;i<size;i++) {
mpz_export(t,tc,1,sizeof(unsigned long),0,0, ARRAY(i));
t+=*tc;
tc++;
}
count=t-buf;
countc=tc-bc;
MPI_Send(&count, 1, MPI_UNSIGNED_LONG, 0, 0, MPI_COMM_WORLD);
MPI_Send(&countc, 1, MPI_UNSIGNED_LONG, 0, 0, MPI_COMM_WORLD);
MPI_Send(bc, countc*(sizeof(size_t)), MPI_CHAR, 0, 0, MPI_COMM_WORLD);
MPI_Send(buf, count, MPI_UNSIGNED_LONG, 0, 0, MPI_COMM_WORLD);
} else {
status=MPI_Recv(&count, 1, MPI_UNSIGNED_LONG, SENDER_RANK, 0, MPI_COMM_WORLD, NULL);
status=MPI_Recv(&countc, 1, MPI_UNSIGNED_LONG, SENDER_RANK, 0, MPI_COMM_WORLD, NULL);
t=buf;
tc=bc;
status=MPI_Recv(bc, countc*(sizeof(size_t)), MPI_CHAR, SENDER_RANK, 0, MPI_COMM_WORLD, NULL);
status=MPI_Recv(buf, count, MPI_UNSIGNED_LONG, SENDER_RANK, 0, MPI_COMM_WORLD, NULL);
for(int i; i<size; i++) {
mpz_import(ARRAY(i),*tc,1,sizeof(unsigned long),0,0, t);
t+=*tc;
tc++;
}
}
free(buf);
free(bc);