Is it safe to pass function pointers via MPI as a way of telling another node to call a function? Someone may say that Passing any kind of pointers via MPI is meaningless, but I wrote some code to verify it.
//test.cpp
#include <cstdio>
#include <iostream>
#include <mpi.h>
#include <cstring>
using namespace std;
int f1(int a){return a + 1;}
int f2(int a){return a + 2;}
int f3(int a){return a + 3;}
using F=int (*)(int);
int main(int argc, char *argv[]){
MPI_Init(&argc, &argv);
int rank, size;
MPI_Status state;
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
//test
char data[10];
if( 0 == rank ){
*(reinterpret_cast<F*>(data))=&f2;
for(int i = 1 ; i < size ; ++i)
MPI_Send(data, 8, MPI_CHAR, i, 0, MPI_COMM_WORLD);
}else{
MPI_Recv(data, 8, MPI_CHAR, 0, 0, MPI_COMM_WORLD, &state);
F* fp = reinterpret_cast<F*>(data);
int ans = (**fp)(10);
cout << ans << endl;
}
MPI_Finalize();
return 0;
}
Here is the output:
12
12
12
12
12
12
12
12
12
I ran it via MVAPICH, and it works well. But I just don't now why since separate address spaces means that the pointer value is USELESS in any process other than the one that generated it.
P.S. here is my hostfile
blade11:1
blade12:1
blade13:1
blade14:1
blade15:1
blade16:1
blade17:1
blade18:2
blade19:1
and I ran mpiexec -n 10 -f hostfile ./test, and compiled it using C++11
You are lucky in the sense that your cluster environment is homogeneous and no address space randomisation for ordinary executables is in place. As a consequence, all images are loaded at the same base address and laid out similarly in memory, hence functions have the same virtual addresses in all MPI ranks (note that this is rarely true for symbols from dynamically linked libraries as those are usually loaded at random addresses).
If you compile the source twice using different compilers or using the same compiler but with different compiler options, then have some ranks run the first executable and the rest run the second one, the program will definitely crash.
Try this:
$ mpicxx -std=c++11 -O0 -o test_O0 test.cpp
$ mpicxx -std=c++11 -O2 -o test_O2 test.cpp
$ mpiexec -f hostfile -n 5 ./test_O0 : -n 5 ./test_O2
12
12
12
12
<crash>
The different levels of optimisation result in function code of different size in test_O0 and test_O2. Consequently, f2 will no longer have the same virtual address in all ranks. The ranks that run the same executable as rank 0 will print 12, while the rest will segfault.
Is it safe to pass function pointers via MPI as a way of telling another node to call a function?
No, it is not. Address space is not shared among processes.
However, MPI processes which are the result of programs built from the same source can be organised to call a specific function when a certain message is received:
char data = 0;
MPI_Recv(data, 1, MPI_CHAR, 0, 0, MPI_COMM_WORLD, &state);
if (data == 255) {
f2(10); /* and so forth */
}
No.
However there is trick involving macros that map a certain codification of a function to a local function pointer/callback that can be recognized in all processes uniformly.
For example, this is used in HPX http://stellar.cct.lsu.edu/files/hpx_0.9.5/html/HPX_PLAIN_ACTION.html to run a function across inhomogeneous systems.
Related
I'm dealing with a routine which I want the first time to be executed by the CPU and every other time by the GPU. This routine contains the loop:
for (k = kb; k <= ke; k++){
for (j = jb; j <= je; j++){
for (i = ib; i <= ie; i++){
...
}}}
I tried with adding #pragma acc loop collapse(3) to the loop and #pragma acc routine(routine) vector just before the calls where I want the GPU to execute the routine. -Minfo=accel doesn't report any message and with Nsight-System I see that the routine is always executed by the CPU so in this way it doesn't work.
Why the compiler is reading neither of the two #pragma?
To follow on to Thomas' answer, here's an example of using the "if" clause:
% cat test.c
#include <stdlib.h>
#include <stdio.h>
void compute(int * Arr, int size, int use_gpu) {
#pragma acc parallel loop copyout(Arr[:size]) if(use_gpu)
for (int i=0; i < size; ++i) {
Arr[i] = i;
}
}
int main() {
int *Arr;
int size;
int use_gpu;
size=1024;
Arr = (int*) malloc(sizeof(int)*size);
// Run on the host
use_gpu=0;
compute(Arr,size,use_gpu);
// Run on the GPU
use_gpu=1;
compute(Arr,size,use_gpu);
free(Arr);
}
% nvc -acc -Minfo=accel test.c
compute:
4, Generating copyout(Arr[:size]) [if not already present]
Generating NVIDIA GPU code
7, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
% setenv NV_ACC_TIME 1
% a.out
Accelerator Kernel Timing data
test.c
compute NVIDIA devicenum=0
time(us): 48
4: compute region reached 1 time
4: kernel launched 1 time
grid: [8] block: [128]
device time(us): total=5 max=5 min=5 avg=5
elapsed time(us): total=331 max=331 min=331 avg=331
4: data region reached 2 times
9: data copyout transfers: 1
device time(us): total=43 max=43 min=43 avg=43
I'm using nvc and set the compiler's runtime profiler (NV_ACC_TIME=1) to show that the kernel is launched only once.
You need to enable OpenACC processing: -acc (with NVHPC tools) or -fopenacc (with GCC), for example, and then you need to use an OpenACC compute construct (parallel, kernels) to actually launch parallel GPU execution (plus host/device memory management, as necessary). For example, you could call your routine from that compute construct, and the routine would annotate the loop nest with OpenACC loop directives, as you've mentioned, to actually make use of the GPU parallelism.
Then, to answer your actual question: the OpenACC compute constructs then support an if clause to specify whether the region will execute on the current device ("GPU") vs. the local thread will execute the region ("CPU").
I have implemented random sequence generator in python and want to test the results in TestU01. But I am not getting how to give input for that test suite and also suggest me that how many bit sequence I need to generate to test the sequence
TestU01 is a library and doesn't come with executables. It mostly has methods to test C generators which implement unif01_Gen defined in unif01.h. See guideshorttest01.pdf.
However, it does come with a few methods which test binary files. Here is a short program which calls them:
#include <stdio.h>
#include "gdef.h"
#include "swrite.h"
#include "bbattery.h"
int main (int argc, char *argv[])
{
if (argc != 2) {
printf("Specify binary file of random bits as ./test <path>");
return 0;
}
FILE* fp = fopen(argv[1], "r");
fseek(fp, 0L, SEEK_END);
size_t sz = ftell(fp) * 8;
fclose(fp);
printf("Reading binary file %s of size %d bits", argv[1], sz);
swrite_Basic = FALSE;
bbattery_RabbitFile (argv[1], sz);
bbattery_AlphabitFile (argv[1], sz);
bbattery_FIPS_140_2File (argv[1]);
return 0;
}
After installing TestU01 (it's in the Arch/Manjaro AUR, in case that helps), compile it with: gcc test.c -o test -ltestu01
Here is a Python program which generates a random binary file. Note that the tests work on 32-bit blocks, and I suggest to stick to that when generating the file.
size = 1024*1024
rand = Random()
with open("bits", "wb") as f:
for i in range(size//4):
value = rand.getrandbits(32)
s = struct.pack('I', value)
f.write(s)
There is also a version of SmallCrush which reads a text file of about 5 million floats. See bbattery_SmallCrushFile. I haven't tried it, but make sure the floats are written with many digits as the conversion to/from text can break the test.
I don't know much about the theory of testing RNGs, so I can't answer how long of a sequence you need. The TestU01 guide is detailed and might answer your questions.
I'm trying to insert one value into the third location in a host_vector using thrust.
static thrust::host_vector <int *> bins;
int * p;
bins.insert(3, 1, p);
But am getting errors:
error: no instance of overloaded function "thrust::host_vector<T, Alloc>::insert [with T=int *, Alloc=std::allocator<int *>]" matches the argument list
argument types are: (int, int, int *)
object type is: thrust::host_vector<int *, std::allocator<int *>>
Has anyone seen this before, and how can I solve this? I want to use a vector to pass information into the GPU. I was originally trying to use a vector of vectors to represent spatial cells that hold different numbers of data, but learned that wasn't possible with thrust. So instead, I'm using a vector bins that holds my data, sorted by the spatial cell (first 3 values might correspond to the first cell, the next 2 to the second cell, the next 0 to the third cell, etc.). The values held are pointers to particles, and represent the numbers of particles in the spatial cell (which is not known before runtime).
As noted in comments, thrust::host_vector is modelled directly on std::vector and the operation you are trying to use requires an iterator for the position argument, which is why you get a compilation error. You can see this if you consult the relevant documentation:
http://en.cppreference.com/w/cpp/container/vector/insert
https://thrust.github.io/doc/classthrust_1_1host__vector.html#a9bb7c8e26ee8c10c5721b584081ae065
A complete working example of the code snippet you showed would look like this:
#include <iostream>
#include <thrust/host_vector.h>
int main()
{
thrust::host_vector <int *> bins(10, reinterpret_cast<int *>(0));
int * p = reinterpret_cast<int *>(0xdeadbeef);
bins.insert(bins.begin()+3, 1, p);
auto it = bins.begin();
for(int i=0; it != bins.end(); ++it, i++) {
int* v = *it;
std::cout << i << " " << v << std::endl;
}
return 0;
}
Note that this requires that C++11 language features are enabled in nvcc (so use CUDA 8.0):
~/SO$ nvcc -std=c++11 -arch=sm_52 thrust_insert.cu
~/SO$ ./a.out
0 0
1 0
2 0
3 0xdeadbeef
4 0
5 0
6 0
7 0
8 0
9 0
10 0
There is a simple communication program that I used in MPICH2. when I execute the program by using
mpiexec.exe -hosts 2 o00 o01 -noprompt mesajlasma.exe
The program starts but does not end. I can see it is still running on the host computer "o01" by using resource monitor program. When I press CTRL + c, it is ended. Then I can see that my program ran properly
Why doesn't my program end. Where does it stuck? why does my program hang when I use MPI_Send and MPI_Recv?
Thanks in advance
// mesajlasma.cpp
#include "stdafx.h"
#include "string.h"
#include "mpi.h"
int main(int argc, char* argv[])
{
int nTasks, rank;
char mesaj[20];
MPI_Status status;
MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD,&nTasks);
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
//printf ("\nNumber of threads = %d, My rank = %d\n", nTasks, rank);
if(rank == 1)
{
strcpy_s(mesaj, "Hello World");
if (MPI_SUCCESS==MPI_Send(mesaj, strlen(mesaj)+1, MPI_CHAR, 0, 99, MPI_COMM_WORLD)) printf("_OK!_\n");
}
if(rank == 0)
{
MPI_Recv(mesaj, 20, MPI_CHAR, 1, 99, MPI_COMM_WORLD, &status);
printf("Received Message:%s\n", mesaj);
}
MPI_Finalize();
return 0;
}
You probably also need to pass an argument like -n 2 to your mpiexec.exe command in order to instruct it to launch 2 processes. I believe that the -hosts argument is just an alternative way to specify the hosts on which your program can run, not how many processes will be created.
I'm encountering a very strange problem: Mu 9800GT doesnt seem to calculate at all.
I've tried all hello-worlds i've found in the internet, here's one of them:
this program creates 1..100 array on hosts, sends it to device, calculates a square of each value, returns it to host, prints the results.
#include "stdafx.h"
#include <stdio.h>
#include <cuda.h>
__global__ void square_array(float *a, int N)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx<N) a[idx] = a[idx] * a[idx];
}
// main routine that executes on the host
int main(void)
{
float *a_h, *a_d; // Pointer to host & device arrays
const int N = 100; // Number of elements in arrays
size_t size = N * sizeof(float);
a_h = (float *)malloc(size); // Allocate array on host
cudaMalloc((void **) &a_d, size); // Allocate array on device
// Initialize host array and copy it to CUDA device
for (int i=0; i<N; i++) a_h[i] = (float)i;
cudaMemcpy(a_d, a_h, size, cudaMemcpyHostToDevice);
// Do calculation on device:
int block_size = 4;
int n_blocks = N/block_size + (N%block_size == 0 ? 0:1);
square_array <<< n_blocks, block_size >>> (a_d, N);
// Retrieve result from device and store it in host array
cudaMemcpy(a_h, a_d, sizeof(float)*N, cudaMemcpyDeviceToHost);
// Print results
for (int i=0; i<N; i++) printf("%d %f\n", i, a_h[i]);
// Cleanup
free(a_h); cudaFree(a_d);
}
so the output is expected to be:
1 1.000
2 4.000
3 9.000
4 16.000
..
I swear back in 2009 it worked perfectly (vista 32, deviceemu)
now i get output:
1 1.000
2 2.000
3 3.000
4 4.000
so my card doesnt do anything. What can be the problem?
Configuration is:
win7x64
visual studio 2010 32bit
cuda toolkit 3.2 64bit
compilation settings: cuda 3.2 toolkit, 32-bit target platform, deviceemu or not - doesnt matter, the results are the same.
i also tried it on my vmware xp(32bit) visual studio 2008. the result is the same.
Please help me, i barely made the programe to compile, now i need it to work.
You can also view my project with all it needs from my post at nvidia forums ( 2.7 kb)
Thanks, Ilya
Your code produces the intended results on my Linux system so I would suggest checking the error codes returned by cudaMalloc and cudaMemcpy to ensure there are no silent driver/runtime errors. For example
cudaError_t error = cudaMemcpy(a_h, a_d, sizeof(float)*N, cudaMemcpyDeviceToHost);
printf("error status: %s\n", cudaGetErrorString(error));
should print
error status: no error
if the call is successful.
Also, I believe device emulation was deprecated in CUDA 3.0 and removed entirely in CUDA 3.1. I don't know if that's related to your problem though.
To compile several files you'd just do something like this
$nvcc -c foo.cu
$nvcc -c bar.cu
$nvcc -o foobar foo.o bar.o
alternatively, you can do the linking in the last step with g++ like so
$g++ -o foobar foo.o bar.o -L/usr/local/cuda/lib64 -lcudart