I have 4 filepaths in the global_filetable and I am trying to scatter 2 pilepaths to each process.
The process 0 have proper 2 paths, but there is something strange in the process 1 (null)...
Here's the full code:
#include <stdio.h>
#include <limits.h> // PATH_MAX
#include <mpi.h>
int main(int argc, char *argv[])
char** global_filetable = (char**)malloc(4 * PATH_MAX * sizeof(char));
for(int i = 0; i < 4; ++i) {
global_filetable[i] = (char*)malloc(PATH_MAX *sizeof(char));
strncpy (filetable[i], "/path/", PATH_MAX);
/*for(int i = 0; i < 4; ++i) {
printf("%s\n", global_filetable[i]);
int rank, size;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
char** local_filetable = (char**)malloc(2 * PATH_MAX * sizeof(char));
MPI_Scatter(global_filetable, 2*PATH_MAX, MPI_CHAR, local_filetable, 2*PATH_MAX , MPI_CHAR, 0, MPI_COMM_WORLD);
/* now all processors print their local data: */
for (int p = 0; p < size; ++p) {
if (rank == p) {
printf("Local process on rank %d is:\n", rank);
for (int i = 0; i < 2; i++) {
printf("path: %s\n", local_filetable[i]);
return 0;
Local process on rank 0 is:
path: /path/
path: /path/
Local process on rank 1 is:
path: (null)
path: (null)
Do you have any idea why I am having those nulls?
First, your allocation is inconsistent:
char** local_filetable = (char**)malloc(2 * PATH_MAX * sizeof(char));
The type char** indicates an array of char*, but you allocate a contiguous memory block, which would indicate a char*.
The easiest way would be to use the contiguous memory as char* for both global and local filetables. Depending on what get_filetable() actually does, you may have to convert. You can then index it like this:
char* entry = &filetable[i * PATH_MAX]
You can then simply scatter like this:
MPI_Scatter(global_filetable, 2 * PATH_MAX, MPI_CHAR,
local_filetable, 2 * PATH_MAX, MPI_CHAR, 0, MPI_COMM_WORLD);
Note that there is no more displacement, every rank just gets an equal sized chunk of the contiguous memory.
The next step would be to define a C and MPI struct encapsulating PATH_MAX characters so you can get rid of the constant usage of PATH_MAX and crude indexing.
I think this is much nicer (less complex, less memory management) than using actual char**. You would only need that if memory waste or redundant data transfer becomes an issue.
P.S. Make sure to never put in more than PATH_MAX - 1 characters in an filetable entry to keep space for the tailing \0.
Okay, I'm stupid.
char global_filetable[NUMBER_OF_STRINGS][PATH_MAX];
for(int i = 0; i < 4; ++i) {
strcpy (filetable[i], "/path/");
char local_filetable[2][PATH_MAX];
Now it works!
I am very new to CUDA and I am trying to initialize an array on the device and return the result back to the host to print out to show if it was correctly initialized. I am doing this because the end goal is a dot product solution in which I multiply two arrays together, storing the results in another array and then summing up the entire thing so that I only need to return the host one value.
In the code I am working on all I am only trying to see if I am initializing the array correctly. I am trying to create an array of size N following the patterns of 1,2,3,4,5,6,7,8,1,2,3....
This is the code that I've written and it compiles without issue but when I run it the terminal is hanging and I have no clue why. Could someone help me out here? I'm so incredibly confused :\
#include <stdio.h>
#include <stdlib.h>
#include <chrono>
#define ARRAY_SIZE 100
#define BLOCK_SIZE 32
__global__ void cu_kernel (int *a_d,int *b_d,int *c_d, int size)
int x = blockIdx.x * blockDim.x + threadIdx.x;
__shared__ int temp;
if(temp != 8){
a_d[x] = temp;
} else {
a_d[x] = temp;
temp = 1;
int main (int argc, char *argv[])
//declare pointers for arrays
int *a_d, *b_d, *c_d, *sum_h, *sum_d,a_h[ARRAY_SIZE];
//set space for device variables
cudaMalloc((void**) &a_d, sizeof(int) * ARRAY_SIZE);
cudaMalloc((void**) &b_d, sizeof(int) * ARRAY_SIZE);
cudaMalloc((void**) &c_d, sizeof(int) * ARRAY_SIZE);
cudaMalloc((void**) &sum_d, sizeof(int));
// set execution configuration
dim3 dimblock (BLOCK_SIZE);
dim3 dimgrid (ARRAY_SIZE/BLOCK_SIZE);
// actual computation: call the kernel
cu_kernel <<<dimgrid, dimblock>>> (a_d,b_d,c_d,ARRAY_SIZE);
cudaError_t result;
// transfer results back to host
result = cudaMemcpy (a_h, a_d, sizeof(int) * ARRAY_SIZE, cudaMemcpyDeviceToHost);
if (result != cudaSuccess) {
fprintf(stderr, "cudaMemcpy failed.");
// print reversed array
printf ("Final state of the array:\n");
for (int i =0; i < ARRAY_SIZE; i++) {
printf ("%d ", a_h[i]);
printf ("\n");
There are at least 3 issues with your kernel code.
you are using shared memory variable temp without initializing it.
you are not resolving the order in which threads access a shared variable as discussed here.
you are imagining (perhaps) a particular order of thread execution, and CUDA provides no guarantees in that area
The first item seems self-evident, however naive methods to initialize it in a multi-threaded environment like CUDA are not going to work. Firstly we have the multi-threaded access pattern, again, Furthermore, in a multi-block scenario, shared memory in one block is logically distinct from shared memory in another block.
Rather than wrestle with mechanisms unsuited to create the pattern you desire, (informed by notions carried over from a serial processing environment), I would simply do something trivial like this to create the pattern you desire:
__global__ void cu_kernel (int *a_d,int *b_d,int *c_d, int size)
int x = blockIdx.x * blockDim.x + threadIdx.x;
if (x < size) a_d[x] = (x&7) + 1;
Are there other ways to do it? certainly.
__global__ void cu_kernel (int *a_d,int *b_d,int *c_d, int size)
int x = blockIdx.x * blockDim.x + threadIdx.x;
__shared__ int temp;
if (!threadIdx.x) temp = blockIdx.x*blockDim.x;
if (x < size) a_d[x] = ((temp+threadIdx.x) & 7) + 1;
You can get as fancy as you like.
These changes will still leave a few values at zero at the end of the array, which would require changes to your grid sizing. There are many questions about this already, or study a sample code like vectorAdd.
How can an array of structs that has been dynamically allocated on the host be used by a kernel, without passing the array of structs as a kernel argument? This seems like a common procedure with a good amount of documentation online, yet it doesn't work on the following program.
Note: Please note that the following questions have been studied before posting this question:
1) copying host memory to cuda __device__ variable 2) Global variable in CUDA 3) Is there any way to dynamically allocate constant memory? CUDA
So far, unsuccessful attempts have been made to:
Dynamically allocate array of structs with cudaMalloc(), then
Use cudaMemcpyToSymbol() with the pointer returned from cudaMalloc() to copy to a __device__ variable which can be used by the kernel.
Code attempt:
NBody.cu (error checking using cudaStatus has mostly been omitted for better readability, and function to read data from file into dynamic array removed):
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
#include <stdlib.h>
#define BLOCK 256
struct nbody {
float x, y, vx, vy, m;
typedef struct nbody nbody;
// Global declarations
nbody* particle;
// Device variables
__device__ unsigned int d_N; // Kernel can successfully access this
__device__ nbody d_particle; // Update: part of problem was here with (*)
// Aim of kernel: to print contents of array of structs without using kernel argument
__global__ void step_cuda_v1() {
int i = threadIdx.x + blockDim.x * blockIdx.x;
if (i < d_N) {
printf("%.f\n", d_particle.x);
int main() {
unsigned int N = 10;
unsigned int I = 1;
cudaMallocHost((void**)&particle, N * sizeof(nbody)); // Host allocation
cudaError_t cudaStatus;
for (int i = 0; i < N; i++) particle[i].x = i;
nbody* particle_buf; // device buffer
cudaMalloc((void**)&particle_buf, N * sizeof(nbody)); // Allocate device mem
cudaMemcpy(particle_buf, particle, N * sizeof(nbody), cudaMemcpyHostToDevice); // Copy data into device mem
cudaMemcpyToSymbol(d_particle, &particle_buf, sizeof(nbody*)); // Copy pointer to data into __device__ var
cudaMemcpyToSymbol(d_N, &N, sizeof(unsigned int)); // This works fine
int NThreadBlock = (N + BLOCK - 1) / BLOCK;
for (int iteration = 0; iteration <= I; iteration++) {
step_cuda_v1 << <NThreadBlock, BLOCK >> > ();
//step_cuda_v1 << <1, 5 >> > (particle_buf);
cudaStatus = cudaGetLastError();
if (cudaStatus != cudaSuccess)
fprintf(stderr, "ERROR: %s\n", cudaGetErrorString(cudaStatus));
return 0;
"ERROR: kernel launch failed."
How can I print the contents of the array of structs from the kernel, without passing it as a kernel argument?
Coding in C using VS2019 with CUDA 10.2
With the help of #Robert Crovella and #talonmies, here is the solution that outputs a sequence that cycles from 0 to 9 repeatedly.
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
#include <stdlib.h>
#define BLOCK 256
//#include "Nbody.h"
struct nbody {
float x, y, vx, vy, m;
typedef struct nbody nbody;
// Global declarations
nbody* particle;
// Device variables
__device__ unsigned int d_N; // Kernel can successfully access this
__device__ nbody* d_particle;
//__device__ nbody d_particle; // Update: part of problem was here with (*)
// Aim of kernel: to print contents of array of structs without using kernel argument
__global__ void step_cuda_v1() {
int i = threadIdx.x + blockDim.x * blockIdx.x;
if (i < d_N) {
printf("%.f\n", d_particle[i].x);
int main() {
unsigned int N = 10;
unsigned int I = 1;
cudaMallocHost((void**)&particle, N * sizeof(nbody)); // Host allocation
cudaError_t cudaStatus;
for (int i = 0; i < N; i++) particle[i].x = i;
nbody* particle_buf; // device buffer
cudaMalloc((void**)&particle_buf, N * sizeof(nbody)); // Allocate device mem
cudaMemcpy(particle_buf, particle, N * sizeof(nbody), cudaMemcpyHostToDevice); // Copy data into device mem
cudaMemcpyToSymbol(d_particle, &particle_buf, sizeof(nbody*)); // Copy pointer to data into __device__ var
cudaMemcpyToSymbol(d_N, &N, sizeof(unsigned int)); // This works fine
int NThreadBlock = (N + BLOCK - 1) / BLOCK;
for (int iteration = 0; iteration <= I; iteration++) {
step_cuda_v1 << <NThreadBlock, BLOCK >> > ();
//step_cuda_v1 << <1, 5 >> > (particle_buf);
cudaStatus = cudaGetLastError();
if (cudaStatus != cudaSuccess)
fprintf(stderr, "ERROR: %s\n", cudaGetErrorString(cudaStatus));
return 0;
I have a dynamically allocated array that is sent by rank 0 to other ranks using MPI_Send()
On the receiving side, a dynamic array is allocated memory using malloc()
MPI_Recv() happens on the other ranks. At this receive function, I get invalid Buffer Pointer error.
Code is conceptually similar to this:
struct graph{
int count;
int * array;
} a_graph;
int x = 10;
MPI_Status status;
if (rank == 0){
a_graph * my_graph = malloc(sizeof(my_graph))
my_graph->count = x;
my_graph->array = malloc(sizeof(int)*my_graph->count);
for(int i =0; i < my_graph->count; i++)
my_graph->array[i] = i;
else if (rank == 1){
a_graph * my_graph = malloc(sizeof(my_graph))
my_graph->count = x;
my_graph->array = malloc(sizeof(int)*my_graph->count);
MPI_Recv(my_graph->array,my_graph->count,int,0,0,MPI_COMM_WORLD,&status) // MPI INVALID BUFFER POINTER ERROR HAPPENS AT THIS RECV
I dont understand why this happens since memory is allocated in both sender and receiver ranks
Below is a minimal, working, and verifiable (MWVE) example which Zulan suggested you to make. Please provide MWVE in your future questions. Anyway, you need to use MPI datatype MPI_INT instead of int for sending and receiving.
#include <mpi.h>
#include <stdlib.h>
#include <stdio.h>
typedef struct graph{
int count;
int * array;
} a_graph;
int main()
int rank;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
int x = 10;
MPI_Status status;
if (rank == 0){
a_graph * my_graph = malloc(sizeof(a_graph));
my_graph->count = x;
my_graph->array = malloc(sizeof(int)*my_graph->count);
for(int i =0; i < my_graph->count; i++)
my_graph->array[i] = i;
else if (rank == 1){
a_graph * my_graph = malloc(sizeof(a_graph));
my_graph->count = x;
my_graph->array = malloc(sizeof(int)*my_graph->count);
for (int i=0; i<my_graph->count; ++i)
printf("%i\n", my_graph->array[i]);
return 0;
I am trying to invert a PGM image using MPI. The grayscale (PGM) image should be loaded on the root processor and then be sent to each of the s^2 processors. Each processor will invert a block of the given image, and the inverted blocks will be gathered back on the root processor, which will assemble the blocks into the final image and write it to a PGM image. I ran the following code, but did not get any output. The image was read after running the code, but there was no indication of writing the resultant image. Could you please let me know what could be wrong with it?
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
#include <time.h>
#include <string.h>
#include <math.h>
#include <memory.h>
#define max(x, y) ((x>y) ? (x):(y))
#define min(x, y) ((x<y) ? (x):(y))
int xdim;
int ydim;
int maxraw;
unsigned char *image;
void ReadPGM(FILE*);
void WritePGM(FILE*);
#define s 2
int main(int argc, char **argv) {
MPI_Init(&argc, &argv);
int p, rank;
MPI_Comm_size(MPI_COMM_WORLD, &p);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
const int NPROWS=s; /* number of rows in _decomposition_ */
const int NPCOLS=s; /* number of cols in _decomposition_ */
const int BLOCKROWS = xdim/NPROWS; /* number of rows in _block_ */
const int BLOCKCOLS = ydim/NPCOLS; /* number of cols in _block_ */
int i, j;
FILE *fp;
for (int ii=0; ii<BLOCKROWS*BLOCKCOLS; ii++)
BLimage[ii] = 0;
for (int ii=0; ii<BLOCKROWS*BLOCKCOLS; ii++)
BLfilteredMat[ii] = 0;
if (rank == 0) {
/* begin reading PGM.... */
MPI_Datatype blocktype;
MPI_Datatype blocktype2;
MPI_Type_vector(BLOCKROWS, BLOCKCOLS, ydim, MPI_FLOAT, &blocktype2);
MPI_Type_create_resized( blocktype2, 0, sizeof(float), &blocktype);
int disps[NPROWS*NPCOLS];
int counts[NPROWS*NPCOLS];
for (int ii=0; ii<NPROWS; ii++) {
for (int jj=0; jj<NPCOLS; jj++) {
disps[ii*NPCOLS+jj] = ii*ydim*BLOCKROWS+jj*BLOCKCOLS;
counts [ii*NPCOLS+jj] = 1;
MPI_Scatterv(image, counts, disps, blocktype, BLimage, BLOCKROWS*BLOCKCOLS, MPI_FLOAT, 0, MPI_COMM_WORLD);
//************** Invert the block **************//
for (int proc=0; proc<p; proc++) {
if (proc == rank) {
for (int j = 0; j < BLOCKCOLS; j++) {
for (int i = 0; i < BLOCKROWS; i++) {
BLfilteredMat[j*BLOCKROWS+i] = 255 - image[j*BLOCKROWS+i];
} // close if (proc == rank) {
} // close for (int proc=0; proc<p; proc++) {
MPI_Gatherv(BLfilteredMat, BLOCKROWS*BLOCKCOLS,MPI_FLOAT, image, counts, disps,blocktype, 0, MPI_COMM_WORLD);
if (rank == 0) {
/* Begin writing PGM.... */
return (1);
It is very likely MPI is not the right tool for the job. The reason for this is that your job is inherently bandwidth limited.
Think of it this way: You have a coloring book with images which you all want to color in.
Method 1: you take your time and color them in one by one.
Method 2: you copy each page to a new sheet of paper and mail it to a friend who then colors it in for you. He mails it back to you and in the end you glue all the pages you received from all of your friends together to make one colored-in book.
Note that method two involves copying the whole book, which is arguably the same amount of work needed to color in the whole book. So method two is less time-efficient without even considering the overhead of shoving the pages into an envelope, licking the stamp, going to the post office and waiting for the letter to be delivered.
If you look at your code, every transmitted byte is only touched once throughout the whole program in this line:
BLfilteredMat[j*BLOCKROWS+i] = 255 - image[j*BLOCKROWS+i];
The single processor is much faster at subtracting two integers than it is at sending an integer of the wire, therefore one must advise against using MPI for your particular problem.
My suggestion to solve your problem: Try to avoid unneccessary communication whenever possible. Do all processes have access to the file system on which the files are located? You could try reading them directly from the filesystem.
I have implemented a 2d array Mpi scatter which works well. I mean that the master processor can scatter 2d parts of the initial big array. The problem is when I use as input the 2d image file dynamically allocated it doesn't work. I suppose that there must be something wrong with the memory. Is there any way of obtaining 2d parts of a big 2d array dynamically.
I had a similar problem, but it was one-dimensional vector with dynamically allocated.
Solved my problem as follows:
#include <stdio.h>
#include "mpi.h"
main(int argc, char** argv) {
/* .......Variables Initialisation ......*/
int Numprocs, MyRank, Root = 0;
int index;
int *InputBuffer, *RecvBuffer;
int Scatter_DataSize;
int DataSize;
MPI_Status status;
/* ........MPI Initialisation .......*/
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &MyRank);
MPI_Comm_size(MPI_COMM_WORLD, &Numprocs);
if (MyRank == Root) {
DataSize = 80000;
/* ...Allocate memory.....*/
InputBuffer = (int*) malloc(DataSize * sizeof(int));
for (index = 0; index < DataSize; index++)
InputBuffer[index] = index;
MPI_Bcast(&DataSize, 1, MPI_INT, Root, MPI_COMM_WORLD);
if (DataSize % Numprocs != 0) {
if (MyRank == Root)
printf("Input is not evenly divisible by Number of Processes\n");
Scatter_DataSize = DataSize / Numprocs;
RecvBuffer = (int *) malloc(Scatter_DataSize * sizeof(int));
MPI_Scatter(InputBuffer, Scatter_DataSize, MPI_INT, RecvBuffer,
Scatter_DataSize, MPI_INT, Root, MPI_COMM_WORLD);
for (index = 0; index < Scatter_DataSize; ++index)
printf("MyRank = %d, RecvBuffer[%d] = %d \n", MyRank, index,
This link has examples that have helped me:
Hope this helps.