How I can use MPI with shared memory? - parallel-processing

I have two matrices that I want to multiply. I want to use MPI with shared memory, but do not understand how I can do this.
I do not understand what I should do in the preparatory stage. If the rank of my stream is 0, then I need to initialize the window, but what if the rank is not 0?
If a window is a shared memory location for all threads, then how can I synchronize its use?
Also, I don't understand how to parallelize the multiplication, I know the number of threads. How should the parameters of the cycle be changed?
I've put together a few examples piece by piece, but I doubt my code is correct. Please give me a hint. Thank you!
real :: a(5,6), b(6,4), result(5,6)
call MPI_INIT(ierror)
call MPI_COMM_SIZE(MPI_COMM_WORLD, size_Of_Cluster, ierror) ! get count of process
call MPI_COMM_RANK(MPI_COMM_WORLD, process_Rank, ierror) ! get if of current process
MPI_Win window;
MPI_Win_create(&result, sizeof(&result), sizeof(&result), MPI_INFO_NULL, MPI_COMM_WORLD, &window);
do i = 1,5
result[i,j] = 0
do j = 1,6
result[i,j] = result[i,j] + a[i,j] * b[j,i]
end do
end do
MPI_Win_fence(0, window);
if(process_Rank> 0)
{
// Push my value into the first integer in MPI process 0 window
MPI_Accumulate(&result, 1, MPI_INT, 0, 0, 1, MPI_INT, MPI_SUM, window);
}
MPI_Win_fence(0, window);
if(my_rank == 0)
{
printf("[MPI process 0] result ");
}
MPI_Win_free(&window);
MPI_Finalize();

Related

How to Parallelize a nested for loop using CUDA to perform a computation on a 2D Array

I am working on some research and am very much a beginner at using CUDA. The languages I'm using are C and C++, the basic languages compatible with Nvidia's CUDA. Over the past week I've been stuck on trying to get any sort of speedups through integrating CUDA with my C++ code.
As far as I know I am doing the basics correctly as far as memory allocation and deallocation is concerned. But when it comes to actually speeding up calculations, I am currently receiving different results from the non-CUDA implementation.
In addition, the CUDA implementation is also SLOWER than the normal non cuda version.
The following is the function I am calling the kernel function from. Essentially I moved the computation that was originally in this function into the kernel function in order to parallelize it.
//compute the distance between inputs
void computeInput(int vectorNumber, double *dist, double **weight){
double *d_dist, **d_weight;
//cout << "Dist[0] Before: " << dist[0] << endl;
cudaMalloc(&d_dist, maxClusters * sizeof(double));
cudaMalloc(&d_weight, maxClusters * vector_length * sizeof(double));
// cout << "Memory Allocated" << endl;
//copy variables from host machine running on CPU to Kernel running on GPU
cudaMemcpy(d_dist, dist, maxClusters * sizeof(double), cudaMemcpyHostToDevice);
cudaMemcpy(d_weight, weight, maxClusters * vector_length * sizeof(double), cudaMemcpyHostToDevice);
// cout << "Variables copied to GPU Device." << endl;
//kernel currently being run with 1 blocks with 4 threads for each block.
//right now only a single loop is parallelized, I need to parallelize each loop individually or 2d arrays individually.
dim3 blocks(8,8);
dim3 grid(1, 1);
threadedInput<<<grid,blocks>>>(vectorNumber, d_dist, d_weight);
// cout << "Kernel Run." << endl;
//Waits for the GPU to finish computations
cudaDeviceSynchronize();
//cout << "Weight[0][0] : " << weight[0][0];
//copy back varaible from kernelspace on GPU to host on CPU into variable weight
cudaMemcpy(weight, d_weight, maxClusters * vector_length * sizeof(double), cudaMemcpyDeviceToHost);
cudaMemcpy(dist, d_dist, maxClusters * sizeof(double), cudaMemcpyDeviceToHost);
// cout << "GPU Memory Copied back to Host" << endl;
cout << "Dist[0] After: " << dist[0] << endl;
cudaFree(d_dist);
cudaFree(d_weight);
//cout << " Cuda Memory Freed" << endl;
}
The following is the Kernel Function. It is calculating the distance using weights on nodes.
What I WANT it to do is to perform each iteration of the loops on separate threads.
What I fear it is doing is messing up the order and performing the wrong calculations. I have already searched through Stack Overflow and other places for help on nested for loop parallelization, yet none of them shed much light on the matter as to what I am doing wrong. Any suggestions?
__global__ void threadedInput(int vecNum, double *dist, double **weight)
{
int tests[vectors][vector_length] = {{0, 1, 1, 0},
{1, 0, 0, 1},
{0, 1, 0, 1},
{1, 0, 1, 0}};
dist[0] = 0.0;
dist[1] = 0.0;
int indexX,indexY, incrX, incrY;
indexX = blockIdx.x * blockDim.x + threadIdx.x;
indexY = blockIdx.y * blockDim.y + threadIdx.y;
incrX = blockDim.x * gridDim.x;
incrY = blockDim.y * gridDim.y;
for(int i = indexY; i <= (maxClusters - 1); i+=incrY)
{
for(int j = indexX; j <= (vectors - 1); j+= incrX)
{
dist[i] += pow((weight[i][j] - tests[vecNum][j]), 2);
}// end inner for
}// end outer for
}// end CUDA-kernel
My Current Output:
Clusters for training input:
Vector (1, 0, 1, 0, ) Place in Bin 0
Vector (1, 1, 1, 0, ) Place in Bin 0
Vector (0, 1, 1, 1, ) Place in Bin 0
Vector (1, 1, 0, 0, ) Place in Bin 0
Weights for Node 0 connections:
0.74753098, 0.75753881, 0.74233157, 0.25246902,
Weights for Node 1 connections:
0.00000000, 0.00000000, 0.00000000, 0.00000000,
Categorized test input:
Vector (0, 1, 1, 0, ) Place in Bin 0
Vector (1, 0, 0, 1, ) Place in Bin 0
Vector (0, 1, 0, 1, ) Place in Bin 0
Vector (1, 0, 1, 0, ) Place in Bin 0
Time Ran: 0.96623900
Expected Output (Except that the expected time it takes should be at least 50% faster)
Clusters for training input:
Vector (1, 0, 1, 0, ) Place in Bin 0
Vector (1, 1, 1, 0, ) Place in Bin 1
Vector (0, 1, 1, 1, ) Place in Bin 0
Vector (1, 1, 0, 0, ) Place in Bin 1
Weights for Node 0 connections:
0.74620975, 0.75889148, 0.74351981, 0.25379025,
Weights for Node 1 connections:
0.75368531, 0.75637331, 0.74105526, 0.24631469,
Categorized test input:
Vector (0, 1, 1, 0, ) Place in Bin 0
Vector (1, 0, 0, 1, ) Place in Bin 1
Vector (0, 1, 0, 1, ) Place in Bin 0
Vector (1, 0, 1, 0, ) Place in Bin 1
Time Ran: 0.00033100
You should read some tutorials, begin with : https://devblogs.nvidia.com/easy-introduction-cuda-c-and-c/
Basically each thread executes the kernel code, so there should be no loop inside.
I am quoting :
Device Code
We now move on to the kernel code.
__global__
void saxpy(int n, float a, float *x, float *y)
{
int i = blockIdx.x*blockDim.x + threadIdx.x;
if (i < n) y[i] = a*x[i] + y[i];
}
In CUDA, we define kernels such as saxpy using the global declaration >specifier. Variables defined within device code do not need to be specified >as device variables because they are assumed to reside on the device. In >this case the n, a and i variables will be stored by each thread in a >register, and the pointers x and y must be pointers to the device memory >address space. This is indeed true because we passed d_x and d_y to the >kernel when we launched it from the host code. The first two arguments, n >and a, however, were not explicitly transferred to the device in host code. >Because function arguments are passed by value by default in C/C++, the >CUDA runtime can automatically handle the transfer of these values to the >device. This feature of the CUDA Runtime API makes launching kernels on the >GPU very natural and easy—it is almost the same as calling a C function.
There are only two lines in our saxpy kernel. As mentioned earlier, the >kernel is executed by multiple threads in parallel. If we want each thread >to process an element of the resultant array, then we need a means of >distinguishing and identifying each thread. CUDA defines the variables >blockDim, blockIdx, and threadIdx. These predefined variables are of type >dim3, analogous to the execution configuration parameters in host code. The >predefined variable blockDim contains the dimensions of each thread block >as specified in the second execution configuration parameter for the kernel >launch. The predefined variables threadIdx and blockIdx contain the index >of the thread within its thread block and the thread block within the grid, >respectively. The expression:
int i = blockDim.x * blockIdx.x + threadIdx.x
generates a global index that is used to access elements of the arrays. We >didn’t use it in this example, but there is also gridDim which contains the >dimensions of the grid as specified in the first execution configuration >parameter to the launch.
Before this index is used to access array elements, its value is checked >against the number of elements, n, to ensure there are no out-of-bounds >memory accesses. This check is required for cases where the number of >elements in an array is not evenly divisible by the thread block size, and >as a result the number of threads launched by the kernel is larger than the >array size. The second line of the kernel performs the element-wise work of >the SAXPY, and other than the bounds check, it is identical to the inner >loop of a host implementation of SAXPY.
if (i < n) y[i] = a*x[i] + y[i];

MPI_Bcast synchronization between receiver and sender

Imagine having n processes each hold a matrix of 2 rows and 8 elements (stored linearly, not in 2D). I want each processes to communicate its rows to all processes with lower ranks. For instance, the process with rank 2 communicates its rows to the processes with rank 1 and 0; the process with rank 0 does not communicate its rows to any process.
I'm having issues deciding how to approach this problem. Using MPI_Bcast is a possible solution, but I can't seem to get the operation to work as expected. Below you can see a sample of the code I'm executing.
// npes is the number of processes obtained from MPI_INIT
// The value for i below is used to specify the number of
// rows that will be received
for (i = (npes - rank - 1) * rowsPerProcess; i > 0; i--) {
// Receive
MPI_Bcast(temp, columns, MPI_DOUBLE, i/rowsPerProcess, MPI_COMM_WORLD);
printf("I'm %d and I received from %d\n", rank, i/rowsPerProcess);
}
if (rank != 0) { // rank 0 does not send data
for (row = rowsPerProcess - 1; row >= 0; row--) {
for (j = 0; j < columns; j++) {
//matrix_chunk is the per process matrix of 2 rows
temp[j] = matrix_chunk[row*columns + j];
}
// Send
printf("I'm sender %d\n", rank);
MPI_Bcast(temp, columns, MPI_DOUBLE, rank, MPI_COMM_WORLD);
}
}
The output I receive is the following:
I'm 1 and I received from 1
I'm sender 2
I'm sender 2
I'm 0 and I received from 2
I'm 0 and I received from 1
I'm 0 and I received from 1
I'm 0 and I received from 0
I'm 1 and I received from 0
I'm sender 1
I'm sender 1
It seems that the first receive MPI_Bcast call is executing as a sender operation. I have also printed the contents of the received temp matrix and they are not what I expect them to be.
More than trying to correct this mess, I would like to get a perspective on how I can perform this particular communication problem. I feel like I'm approaching this from the wrong direction. Please let me know if you have any suggestions!
I implemented matched mpi_send and mpi_recv as suggested by High Performance Mark. The problem immediately made sense when I thought of it through this approach.

MPI help on how to parallelize my code

I am very much a newbie in this subject and need help on how to parallelize my code.
I have a large 1D array that in reality describes a 3D volume: 21x21x21 single precision values.
I have 3 computers that I want to engage in the computation. The operation that is performed on each cell in the grid(volume) is identical for all cells. The program takes in some data and perform some simple arithmetics on them and the return value is assigned to the grid cell.
My non-parallized code is:
float zg, yg, xg;
stack_result = new float[Nz*Ny*Nx];
// StrMtrx[8] is the vertical step size, StrMtrx[6] is the vertical starting point
for (int iz=0; iz<Nz; iz++) {
zg = iz*StRMtrx[8]+StRMtrx[6]; // find the vertical position in meters
// StrMtrx[5] is the crossline step size, StrMtrx[3] is the crossline starting point
for (int iy=0; iy<Ny; iy++) {
yg = iy*StRMtrx[5]+StRMtrx[3]; // find the crossline position
// StrMtrx[2] is the inline step size, StrMtrx[0] is the inline starting point
for (int ix=0; ix < nx; ix++) {
xg = ix*StRMtrx[2]+StRMtrx[0]; // find the inline position
// do stacking on each grid cell
// "Geoph" is the geophone ids, "Ngeo" is the number of geophones involved,
// "pahse_use" is the wave type, "EnvMtrx" is the input data common to all
// cells, "Mdata" is the length of input data
stack_result[ix+Nx*iy+Nx*Ny*iz] =
stack_for_qds(Geoph, Ngeo, phase_use, xg, yg, zg, EnvMtrx, Mdata);
}
}
}
Now I take in 3 computers and divide the volume in 3 vertical segments, so I would then have 3 sub-volumes each 21x21x7 cells. (note the parsing of the volume is in z,y,x).
The variable "stack_result" is the complete volume.
My parallellized version (which utterly fails, I only get one of the sub-volumes back) is:
MPI_Status status;
int rank, numProcs, rootProcess;
ierr = MPI_Init(&argc, &argv);
ierr = MPI_Comm_rank(MPI_COMM_WORLD, &rank);
ierr = MPI_Comm_size(MPI_COMM_WORLD, &numProcs);
int rowsInZ = Nz/numProcs; // 7 cells in Z (vertical)
int chunkSize = Nx*Ny*rowsInZ;
float *stack_result = new float[Nz*Ny*Nx];
float zg, yg, xg;
rootProcess = 0;
if(rank == rootProcess) {
offset = 0;
for (int n = 1; n < numProcs; n++) {
// send rank
MPI_Send(&n, 1, MPI_INT, n, 2, MPI_COMM_WORLD);
// send the offset in array
MPI_Send(&offset, 1, MPI_INT, n, 2, MPI_COMM_WORLD);
// send volume, now only filled with zeros,
MPI_Send(&stack_result[offset], chunkSize, MPI_FLOAT, n, 1, MPI_COMM_WORLD);
offset = offset+chunkSize;
}
// receive results
for (int n = 1; n < numProcs; n++) {
int source = n;
MPI_Recv(&offset, 1, MPI_INT, source, 2, MPI_COMM_WORLD, &status);
MPI_Recv(&stack_result[offset], chunkSize, MPI_FLOAT, source, 1, MPI_COMM_WORLD, &status);
}
} else {
int rank;
int source = 0;
int ierr = MPI_Recv(&rank, 1, MPI_INT, source, 2, MPI_COMM_WORLD, &status);
ierr = MPI_Recv(&offset, 1, MPI_INT, source, 2, MPI_COMM_WORLD, &status);
ierr = MPI_Recv(&stack_result[offset], chunkSize, MPI_FLOAT, source, 1, MPI_COMM_WORLD, &status);
int nz = rowsInZ; // sub-volume vertical length
int startZ = (rank-1)*rowsInZ;
for (int iz = startZ; iz < startZ+nz; iz++) {
zg = iz*StRMtrx[8]+StRMtrx[6];
for (int iy = 0; iy < Ny; iy++) {
yg = iy*StRMtrx[5]+StRMtrx[3];
for (int ix = 0; ix < Nx; ix++) {
xg = ix*StRMtrx[2]+StRMtrx[0];
stack_result[offset+ix+Nx*iy+Nx*Ny*iz]=
stack_for_qds(Geoph, Ngeo, phase_use, xg, yg, zg, EnvMtrx, Mdata);
} // x-loop
} // y-loop
} // z-loop
MPI_Send(&offset, 1, MPI_INT, source, 2, MPI_COMM_WORLD);
MPI_Send(&stack_result[offset], chunkSize, MPI_FLOAT, source, 1, MPI_COMM_WORLD);
} // else
write("stackresult.dat", stack_result);
delete [] stack_result;
MPI_Finalize();
Thanks in advance for your patience.
You are calling write("stackresult.dat", stack_result); in all MPI ranks. As a result, they all write into and thus overwrite the same file and what you see is the content written by the last MPI process to execute that code statement. You should move the writing into the body of the if (rank == rootProcess) conditional so that only the root process will write.
As a side note, sending the value of the rank is redundant - MPI already assigns each process a rank that ranges from 0 to #processes - 1. That also makes sending of the offset redundant since each MPI process could easily compute the offset on its own based on its rank.

mpi parallel program to find prime numbers. Please help me dubug

I wrote the following program to find prime number with the #defined value. It is parallel program using mpi. Can anyone help me find a error in it. It compile well but crashes while executing.
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
#define N 65
int rank, size;
double start_time;
double end_time;
int y, x, i, port1, port2, port3;
int check =0; // prime number checker, if a number is prime it always remains 0 through out calculation. for a number which is not prime it is turns to value 1 at some point
int signal =0; // has no important use. just to check if slave process work is done.
MPI_Status status;
MPI_Request request;
int main(int argc, char *argv[]){
MPI_Init(&argc, &argv); //initialize MPI operations
MPI_Comm_rank(MPI_COMM_WORLD, &rank); //get the rank
MPI_Comm_size(MPI_COMM_WORLD, &size); //get number of processes
if(rank == 0){ // master process divides work and also does initial work itself
start_time = MPI_Wtime();
printf("2\n"); //print prime number 2 first because the algorithm for finding the prime number in this program is just for odd number
port1 = (N/(size-1)); // calculating the suitable amount of work per process
for(i=1;i<size-1;i++){ // master sending the portion of work to each slave
port2 = port1 * i; // lower bound of work for i th process
port3 = ((i+1)*port1)-1; // upper bound of work for i th process
MPI_Isend(&port2, 1, MPI_INT, i, 100, MPI_COMM_WORLD, &request);
MPI_Isend(&port3, 1, MPI_INT, i, 101, MPI_COMM_WORLD, &request);
}
port2 = (size-1)*port1; port3= N; // the last process takes the remaining work
MPI_Isend(&port2, 1, MPI_INT, (size-1), 100, MPI_COMM_WORLD, &request);
MPI_Isend(&port3, 1, MPI_INT, (size-1), 101, MPI_COMM_WORLD, &request);
for(x = 3; x < port1; x=x+2){ // master doing initial work by itself
check = 0;
for(y = 3; y <= x/2; y=y+2){
if(x%y == 0) {check =1; break;}
}
if(check==0) printf("%d\n", x);
}
}
if (rank > 0){ // slave working part
MPI_Recv(&port2,1,MPI_INT, 0, 100, MPI_COMM_WORLD, &status);
MPI_Recv(&port3,1,MPI_INT, 0, 101, MPI_COMM_WORLD, &status);
if (port2%2 == 0) port2++; // changing the even argument to odd to make the calculation fast because even number is never a prime except 2.
for(x=port2; x<=port3; x=x+2){
check = 0;
for(y = 3; y <= x/2; y=y+2){
if(x%y == 0) {check =1; break;}
}
if (check==0) printf("%d\n",x);
}
signal= rank;
MPI_Isend(&signal, 1, MPI_INT, 0, 103, MPI_COMM_WORLD, &request); // just informing master that the work is finished
}
if (rank == 0){ // master concluding the work and printing the time taken to do the work
for(i== 1; i < size; i++){
MPI_Recv(&signal,1,MPI_INT, i, 103, MPI_COMM_WORLD, &status); // master confirming that all slaves finished their work
}
end_time = MPI_Wtime();
printf("\nRunning Time = %f \n\n", end_time - start_time);
}
MPI_Finalize();
return 0;
}
I got following error
mpirun -np 2 ./a.exe
Exception: STATUS_ACCESS_VIOLATION at eip=0051401C
End of stack trace
I found what was wrong with my program.
It was the use of the restricted variable signal. change the name of that variable (in all places it is used) to any other viable name and it works.

Will MPI_Send block if matching MPI_IRecv takes less data elements?

Assume the following MPI Code.
MPI_Comm_Rank(MPI_COMM_WORLD, &rank);
if (rank == 0){
MPI_Send(a, count, MPI_CHAR, 1, 0, MPI_COMM_WORLD);
MPI_Send(b, count, MPI_CHAR, 1, 0, MPI_COMM_WORLD);
}
else if (rank == 1){
MPI_IRecv(a, 1, MPI_CHAR, 0, 0, MPI_COMM_WORLD, &req);
MPI_Recv(b, count, MPI_CHAR, 0, 0, MPI_COMM_WORLD);
MPI_Wait(&req, &status);
}
Is it correct to say the the first MPI_Send(a, count, ...) will not block even though its matching MPI_IRecv(a, 1, ...) is only reading one element from the buffer?
Also, since no reads/writes are done to buffer a, is it correct Process 1 will not block even though MPI_Wait is not called directly after MPI_IRecv?
Thanks.
MPI_Send will block...it is a blocking call. The "blocking" call will return when the send/recv buffer can be safely read/modified by the calling application. No guarantees are made about the matching MPI_[I]recv call.
The MPI library does not know anything about the read/write status of the buffers in the application. The MPI standard calls for certain guarantees to be made by the application about the stability of the message buffers.

Resources