report PGCC-S-0000-Internal errors for _mp_malloc while there's no heap allocations

report PGCC-S-0000-Internal errors for _mp_malloc while there's no heap allocations - openacc

When I tried to compile my code in OpenACC, it reports:
PGCC-S-0000-Internal compiler error. Call in OpenACC region to support routine - _mp_malloc (/home/lisanhu/mine/ws/C/AccSeqC/as_align.cc: 92)
PGCC-S-0155-Compiler failed to translate accelerator region (see -Minfo messages) (/home/lisanhu/mine/ws/C/AccSeqC/as_align.cc: 92)
inexact_dfs_iter_search(const char *, long, long, const long *, const long *, long, const char *, const long *, const char *, acc_range *, int):
92, Generating acc routine seq
93, Accelerator restriction: unsupported call to support routine '_mp_malloc'
While the reported function is defined as following:
int
inexact_dfs_iter_search(const char *query, const array_size q_length, array_size allowed_diffs,
const array_size *c, const array_size *o, array_size r_length,
const char *ref_bwt, const array_size *rev_o, const char *rev_bwt,
Range *res, int num_of_res) {
array_size d[q_length];
calculateD(query, q_length, r_length, c, rev_o, rev_bwt, d);
// for (int i = 0; i < q_length; ++i) {
// cout << d[i];
// }
// cout << endl;
// cout << strndup(query, q_length) << endl;
Profile p{q_length - 1, 1, r_length - 1, allowed_diffs};
int prof_size = 9 * q_length + 1;
Profile profs[prof_size];
Stack<Profile> profiles(profs, prof_size);
profiles.push(p);
Heap<Range> results(res, num_of_res);
while (!profiles.empty() && !results.full()) {
if (profiles.full()) {
return 1;
}
// p = profiles.peek();
p = profiles.pop();
inex_dfs_process_profile(query, p, c, o, ref_bwt, d, profiles, results);
}
return 0;
}
Line 92 is the 5th line (it's part of function definition, so strange)
I'll really appreciate it if someone can help me with it.

I've found the reason for this. For array_size d[q_length];, it actually calls the _mp_malloc to allocate memory on stack. I tried to replace q_length to constant and it works fine. (although this is a temporary tweak, but I find out he reason finally)

Related

Generate functions at compile time

I have an image. Every pixel contains information about RGB intensity. Now I want to sum intenity these channels, but I also want to choose which channels intensity to sum. Straightforwad implementation of this would look like this:
int intensity(const unsiged char* pixel, bool red, bool green, bool blue){
return 0 + (red ? pixel[0] : 0) + (green ? pixel[1] : 0) + (blue ? pixel[2] : 0);
}
Because I will call this function for every pixel in image I want to discard all conditions If I can. So I guess I have to have a function for every case:
std::function<int(const unsigned char* pixel)> generateIntensityAccumulator(
const bool& accumulateRChannel,
const bool& accumulateGChannel,
const bool& accumulateBChannel)
{
if (accumulateRChannel && accumulateGChannel && accumulateBChannel){
return [](const unsigned char* pixel){
return static_cast<int>(pixel[0]) + static_cast<int>(pixel[1]) + static_cast<int>(pixel[2]);
};
}
if (!accumulateRChannel && accumulateGChannel && accumulateBChannel){
return [](const unsigned char* pixel){
return static_cast<int>(pixel[1]) + static_cast<int>(pixel[2]);
};
}
if (!accumulateRChannel && !accumulateGChannel && accumulateBChannel){
return [](const unsigned char* pixel){
return static_cast<int>(pixel[2]);
};
}
if (!accumulateRChannel && !accumulateGChannel && !accumulateBChannel){
return [](const unsigned char* pixel){
return 0;
};
}
if (accumulateRChannel && !accumulateGChannel && !accumulateBChannel){
return [](const unsigned char* pixel){
return static_cast<int>(pixel[0]);
};
}
if (!accumulateRChannel && accumulateGChannel && !accumulateBChannel){
return [](const unsigned char* pixel){
return static_cast<int>(pixel[1]);
};
}
if (accumulateRChannel && !accumulateGChannel && accumulateBChannel){
return [](const unsigned char* pixel){
return static_cast<int>(pixel[0]) + static_cast<int>(pixel[2]);
};
}
if (accumulateRChannel && accumulateGChannel && !accumulateBChannel){
return [](const unsigned char* pixel){
return static_cast<int>(pixel[0]) + static_cast<int>(pixel[1]);
};
}
}
Now I can use this generator before entering image loop and use function without any conditions:
...
auto accumulator = generateIntensityAccumulator(true, false, true);
for(auto pixel : pixels){
auto intensity = accumulator(pixel);
}
...
But it is a lot of writting for such simple task and I have a feeling that there is a better way to accomplish this: for example make compiler to do a dirty work for me and generate all above cases. Can someone point me in the right direction?

Using a std::function like this will cost you dear, because you dont let a chance for the compiler to optimize by inlining what it can.
What you are trying to do is a good job for templates. And since you use integral numbers, the expression itself may be optimized away, sparing you the need to write a specialization of each version. Look at this example :
#include <array>
#include <chrono>
#include <iostream>
#include <random>
#include <vector>
template <bool AccumulateR, bool AccumulateG, bool AccumulateB>
inline int accumulate(const unsigned char *pixel) {
static constexpr int enableR = static_cast<int>(AccumulateR);
static constexpr int enableG = static_cast<int>(AccumulateG);
static constexpr int enableB = static_cast<int>(AccumulateB);
return enableR * static_cast<int>(pixel[0]) +
enableG * static_cast<int>(pixel[1]) +
enableB * static_cast<int>(pixel[2]);
}
int main(void) {
std::vector<std::array<unsigned char, 3>> pixels(
1e7, std::array<unsigned char, 3>{0, 0, 0});
// Fill up with randomness
std::random_device rd;
std::uniform_int_distribution<unsigned char> dist(0, 255);
for (auto &pixel : pixels) {
pixel[0] = dist(rd);
pixel[1] = dist(rd);
pixel[2] = dist(rd);
}
// Measure perf
using namespace std::chrono;
auto t1 = high_resolution_clock::now();
int sum1 = 0;
for (auto const &pixel : pixels)
sum1 += accumulate<true, true, true>(pixel.data());
auto t2 = high_resolution_clock::now();
int sum2 = 0;
for (auto const &pixel : pixels)
sum2 += accumulate<false, true, false>(pixel.data());
auto t3 = high_resolution_clock::now();
std::cout << "Sum 1 " << sum1 << " in "
<< duration_cast<milliseconds>(t2 - t1).count() << "ms\n";
std::cout << "Sum 2 " << sum2 << " in "
<< duration_cast<milliseconds>(t3 - t2).count() << "ms\n";
}
Compiled with Clang 3.9 with -O2, yields this result on my CPU:
Sum 1 -470682949 in 7ms
Sum 2 1275037960 in 2ms
Please notice the fact that we have an overflow here, you may need to use something bigger than an int. A uint64_t might do. If you inspect assembly code, you will see that the two versions of the function are inlined and optimized differently.

First things first. Don't write a std::function that takes a single pixel; write one that takes a contiguous range of pixels (a scanline of pixels).
Second, you want to write a template version of intensity:
template<bool red, bool green, bool blue>
int intensity(const unsiged char* pixel){
return (red ? pixel[0] : 0) + (green ? pixel[1] : 0) + (blue ? pixel[2] : 0);
}
pretty simple, eh? That will optimize down to your hand-crafted version.
template<std::size_t index>
int intensity(const unsiged char* pixel){
return intensity< index&1, index&2, index&4 >(pixel);
}
this one maps from the bits of index to which of the intensity<bool, bool, bool> to call. Now for the scanline version:
template<std::size_t index, std::size_t pixel_stride=3>
int sum_intensity(const unsiged char* pixel, std::size_t count){
int value = 0;
while(count--) {
value += intensity<index>(pixel);
pixel += pixel_stride;
}
return value;
}
We can now generate our scanline intensity calculator:
int(*)( const unsigned char* pel, std::size_t pixels )
scanline_intensity(bool red, bool green, bool blue) {
static const auto table[] = {
sum_intensity<0b000>, sum_intensity<0b001>,
sum_intensity<0b010>, sum_intensity<0b011>,
sum_intensity<0b100>, sum_intensity<0b101>,
sum_intensity<0b110>, sum_intensity<0b111>,
};
std::size_t index = red + green*2 + blue*4;
return sum_intensity[index];
}
and done.
These techniques can be made generic, but you don't need the generic ones.
If your pixel stride is not 3 (say there is an alpha channel), sum_intensity needs to be passed it (as a template parameter ideally).

Is there a way to help auto-vectorizing compiler to emit saturation arithmetic intrinsic in LLVM?

I have a few for loops that does saturated arithmetic operations.
For instance:
Implementation of saturated add in my case is as follows:
static void addsat(Vector &R, Vector &A, Vector &B)
{
int32_t a, b, r;
int32_t max_add;
int32_t min_add;
const int32_t SAT_VALUE = (1<<(16-1))-1;
const int32_t SAT_VALUE2 = (-SAT_VALUE - 1);
const int32_t sat_cond = (SAT_VALUE <= 0x7fffffff);
const uint32_t SAT = 0xffffffff >> 16;
for (int i=0; i<R.length; i++)
{
a = static_cast<uint32_t>(A.data[i]);
b = static_cast<uint32_t>(B.data[i]);
max_add = (int32_t)0x7fffffff - a;
min_add = (int32_t)0x80000000 - a;
r = (a>0 && b>max_add) ? 0x7fffffff : a + b;
r = (a<0 && b<min_add) ? 0x80000000 : a + b;
if ( sat_cond == 1)
{
std_max(r,r,SAT_VALUE2);
std_min(r,r,SAT_VALUE);
}
else
{
r = static_cast<uint16_t> (static_cast<int32_t> (r));
}
R.data[i] = static_cast<uint16_t>(r);
}
}
I see that there is paddsat intrinsic in x86 that could have been the perfect solution to this loop. I do get the code auto vectorized but with a combination of multiple operations according to my code. I would like to know what could be the best way to write this loop that auto-vectorizer finds the addsat operation match right.
Vector structure is:
struct V {
static constexpr int length = 32;
unsigned short data[32];
};
Compiler used is clang 3.8 and code was compiled for AVX2 Haswell x86-64 architecture.

Is there a memset-like function which can set integer value in visual studio?

1, It is a pity that memset(void* dst, int value, size_t size) fools a lot of people when they first use this function! 2nd parameter "int value" should be "uchar value" to describe the real operation inside.
Don't misunderstand me, I am asking a memset-like function!
2, I know there are some c++ candy function like std::fill_n(my_array, array_length, constant_value);
even a pure c function in OS X: memset_pattern4(grid, &pattern, sizeof grid);
mentioned in a perfect thread Why is memset() incorrectly initializing int?.
So, is there a similar c function in runtime library of visual studio like memset_pattern4()?
3, for somebody asked why i wouldn't use a for-loop to set integer by integer. here is my answer: memset turns to a better performance when setting big trunk(10K?) of memory at least in x86.
http://www.gamedev.net/topic/472631-performance-of-memset/page-2 gives more discussion, although without a conclusion(I doubt there will be).
4, said function can be used to simplify counting sort by avoiding useless Fibonacci accumulation.
Original:
for (int i = 0; i < SRC_ARRY_SIZE; i++)
counter_arry[src_arry[i]]++;
for (int i = SRC_LOW_BOUND; i < SRC_HI_BOUND; i++)//forward fabnacci??
counter_arry[i+1] += counter_arry[i];
for (int i = 0; i < SRC_ARRY_SIZE; i++)
{
value = src_arry[i];
map = --counter_arry[value];//then counter down!
temp[map] = value;
}
Expected:
for (int i = 0; i < SRC_ARRY_SIZE; i++)
counter_arry[src_arry[i]]++;
for (int i = SRC_LOW_BOUND; i < SRC_HI_BOUND+1; i++)//forward fabnacci??
{
memset_4(cur_ptr,i, counter_arry[i]);
cur_ptr += counter_arry[i];
}
Thanks for your kindly review and reply!

Here's an implementation of memset_pattern4() that you might find useful. It's nothing like Darwin's SSE assembly language version, except that it has the same interface.
#include <string.h>
#include <stdint.h>
/*
A portable implementation of the Darwin memset_patternX() family of functions:
These are analogous to memset(), except that they fill memory with a replicated
pattern either 4, 8, or 16 bytes long. b points to a buffer of size len bytes
which is to be filled. The second parameter points to the pattern. If the
buffer length is not an even multiple of the pattern length, the last instance
of the pattern will be truncated. Neither the buffer nor the pattern pointer
need be aligned.
*/
/*
alignment utility macros stolen from Linux
see https://lkml.org/lkml/2006/11/25/2 for a discussion of why typeof() is used
*/
#if !_MSC_VER
#define __ALIGN_KERNEL(x, a) __ALIGN_KERNEL_MASK(x, (typeof(x))(a) - 1)
#define __ALIGN_KERNEL_MASK(x, mask) (((x) + (mask)) & ~(mask))
#define ALIGN(x, a) __ALIGN_KERNEL((x), (a))
#define __ALIGN_MASK(x, mask) __ALIGN_KERNEL_MASK((x), (mask))
#define PTR_ALIGN(p, a) ((typeof(p))ALIGN((uintptr_t)(p), (a)))
#define IS_ALIGNED(x, a) (((x) & ((typeof(x))(a) - 1)) == 0)
#define IS_PTR_ALIGNED(p, a) (IS_ALIGNED((uintptr_t)(p), (a)))
#else
/* MS friendly versions */
/* taken from the DDK's fltKernel.h header */
#define IS_ALIGNED(_pointer, _alignment) \
((((uintptr_t) (_pointer)) & ((_alignment) - 1)) == 0)
#define ROUND_TO_SIZE(_length, _alignment) \
((((uintptr_t)(_length)) + ((_alignment)-1)) & ~(uintptr_t) ((_alignment) - 1))
#define __ALIGN_KERNEL(x, a) ROUND_TO_SIZE( (x), (a))
#define ALIGN(x, a) __ALIGN_KERNEL((x), (a))
#define PTR_ALIGN(p, a) ALIGN((p), (a))
#define IS_PTR_ALIGNED(p, a) (IS_ALIGNED((uintptr_t)(p), (a)))
#endif
void nx_memset_pattern4(void *b, const void *pattern4, size_t len)
{
enum { pattern_len = 4 };
unsigned char* dst = (unsigned char*) b;
unsigned const char* src = (unsigned const char*) pattern4;
if (IS_PTR_ALIGNED( dst, pattern_len) && IS_PTR_ALIGNED( src, pattern_len)) {
/* handle aligned moves */
uint32_t val = *((uint32_t*)src);
uint32_t* word_dst = (uint32_t*) dst;
size_t word_count = len / pattern_len;
dst += word_count * pattern_len;
len -= word_count * pattern_len;
for (; word_count != 0; --word_count) {
*word_dst++ = val;
}
}
else {
while (pattern_len <= len) {
memcpy(dst, src, pattern_len);
dst += pattern_len;
len -= pattern_len;
}
}
memcpy( dst, src, len);
}

Piecemeal processing of a matrix - CUDA

OK, so lets say I have an ( N x N ) matrix that I would like to process. This matrix is quite large for my computer, and if I try to send it to the device all at once I get a 'out of memory error.'
So is there a way to send sections of the matrix to the device? One way I can see to do it is copy portions of the matrix on the host, and then send these manageable copied portions from the host to the device, and then put them back together at the end.
Here is something I have tried, but the cudaMemcpy in the for loop returns error code 11, 'invalid argument.'
int h_N = 10000;
size_t h_size_m = h_N*sizeof(float);
h_A = (float*)malloc(h_size_m*h_size_m);
int d_N = 2500;
size_t d_size_m = d_N*sizeof(float);
InitializeMatrices(h_N);
int i;
int iterations = (h_N*h_N)/(d_N*d_N);
for( i = 0; i < iterations; i++ )
{
float* h_array_ref = h_A+(i*d_N*d_N);
cudasafe( cudaMemcpy(d_A, h_array_ref, d_size_m*d_size_m, cudaMemcpyHostToDevice), "cudaMemcpy");
cudasafe( cudaFree(d_A), "cudaFree(d_A)" );
}
What I'm trying to accomplish with the above code is this: instead of send the entire matrix to the device, I simply send a pointer to a place within that matrix and reserve enough space on the device to do the work, and then with the next iteration of the loop move the pointer forward within the matrix, etc. etc.

Not only can you do this (assuming your problem is easily decomposed this way into sub-arrays), it can be a very useful thing to do for performance; once you get the basic approach you've described working, you can start using asynchronous memory copies and double-buffering to overlap some of the memory transfer time with the time spent computing what is already on-card.
But first one gets the simple thing working. Below is a 1d example (multiplying a vector by a scalar and adding another scalar) but using a linearized 2d array would be the same; the key part is
CHK_CUDA( cudaMalloc(&xd, batchsize*sizeof(float)) );
CHK_CUDA( cudaMalloc(&yd, batchsize*sizeof(float)) );
tick(&gputimer);
int nbatches = 0;
for (int nstart=0; nstart < n; nstart+=batchsize) {
int size=batchsize;
if ((nstart + batchsize) > n) size = n - nstart;
CHK_CUDA( cudaMemcpy(xd, &(x[nstart]), size*sizeof(float), cudaMemcpyHostToDevice) );
blocksize = (size+nblocks-1)/nblocks;
cuda_saxpb<<<nblocks, blocksize>>>(xd, a, b, yd, size);
CHK_CUDA( cudaMemcpy(&(ycuda[nstart]), yd, size*sizeof(float), cudaMemcpyDeviceToHost) );
nbatches++;
}
gputime = tock(&gputimer);
CHK_CUDA( cudaFree(xd) );
CHK_CUDA( cudaFree(yd) );
You allocate the buffers at the start, and then loop through until you're done, each time doing the copy, starting the kernel, and then copying back. You free at the end.
The full code is
#include <stdio.h>
#include <stdlib.h>
#include <getopt.h>
#include <cuda.h>
#include <sys/time.h>
#include <math.h>
#define CHK_CUDA(e) {if (e != cudaSuccess) {fprintf(stderr,"Error: %s\n", cudaGetErrorString(e)); exit(-1);}}
__global__ void cuda_saxpb(const float *xd, const float a, const float b,
float *yd, const int n) {
int i = threadIdx.x + blockIdx.x*blockDim.x;
if (i<n) {
yd[i] = a*xd[i]+b;
}
return;
}
void cpu_saxpb(const float *x, float a, float b, float *y, int n) {
int i;
for (i=0;i<n;i++) {
y[i] = a*x[i]+b;
}
return;
}
int get_options(int argc, char **argv, int *n, int *s, int *nb, float *a, float *b);
void tick(struct timeval *timer);
double tock(struct timeval *timer);
int main(int argc, char **argv) {
int n=1000;
int nblocks=10;
int batchsize=100;
float a = 5.;
float b = -1.;
int err;
float *x, *y, *ycuda;
float *xd, *yd;
double abserr;
int blocksize;
int i;
struct timeval cputimer;
struct timeval gputimer;
double cputime, gputime;
err = get_options(argc, argv, &n, &batchsize, &nblocks, &a, &b);
if (batchsize > n) {
fprintf(stderr, "Resetting batchsize to size of vector, %d\n", n);
batchsize = n;
}
if (err) return 0;
x = (float *)malloc(n*sizeof(float));
if (!x) return 1;
y = (float *)malloc(n*sizeof(float));
if (!y) {free(x); return 1;}
ycuda = (float *)malloc(n*sizeof(float));
if (!ycuda) {free(y); free(x); return 1;}
/* run CPU code */
tick(&cputimer);
cpu_saxpb(x, a, b, y, n);
cputime = tock(&cputimer);
/* run GPU code */
/* only have to allocate once */
CHK_CUDA( cudaMalloc(&xd, batchsize*sizeof(float)) );
CHK_CUDA( cudaMalloc(&yd, batchsize*sizeof(float)) );
tick(&gputimer);
int nbatches = 0;
for (int nstart=0; nstart < n; nstart+=batchsize) {
int size=batchsize;
if ((nstart + batchsize) > n) size = n - nstart;
CHK_CUDA( cudaMemcpy(xd, &(x[nstart]), size*sizeof(float), cudaMemcpyHostToDevice) );
blocksize = (size+nblocks-1)/nblocks;
cuda_saxpb<<<nblocks, blocksize>>>(xd, a, b, yd, size);
CHK_CUDA( cudaMemcpy(&(ycuda[nstart]), yd, size*sizeof(float), cudaMemcpyDeviceToHost) );
nbatches++;
}
gputime = tock(&gputimer);
CHK_CUDA( cudaFree(xd) );
CHK_CUDA( cudaFree(yd) );
abserr = 0.;
for (i=0;i<n;i++) {
abserr += fabs(ycuda[i] - y[i]);
}
printf("Y = a*X + b, problemsize = %d\n", n);
printf("CPU time = %lg millisec.\n", cputime*1000.);
printf("GPU time = %lg millisec (done with %d batches of %d).\n",
gputime*1000., nbatches, batchsize);
printf("CUDA and CPU results differ by %lf\n", abserr);
free(x);
free(y);
free(ycuda);
return 0;
}
int get_options(int argc, char **argv, int *n, int *s, int *nb, float *a, float *b) {
const struct option long_options[] = {
{"nvals" , required_argument, 0, 'n'},
{"nblocks" , required_argument, 0, 'B'},
{"batchsize" , required_argument, 0, 's'},
{"a", required_argument, 0, 'a'},
{"b", required_argument, 0, 'b'},
{"help", no_argument, 0, 'h'},
{0, 0, 0, 0}};
char c;
int option_index;
int tempint;
while (1) {
c = getopt_long(argc, argv, "n:B:a:b:s:h", long_options, &option_index);
if (c == -1) break;
switch(c) {
case 'n': tempint = atoi(optarg);
if (tempint < 1 || tempint > 500000) {
fprintf(stderr,"%s: Cannot use number of points %s;\n Using %d\n", argv[0], optarg, *n);
} else {
*n = tempint;
}
break;
case 's': tempint = atoi(optarg);
if (tempint < 1 || tempint > 50000) {
fprintf(stderr,"%s: Cannot use number of points %s;\n Using %d\n", argv[0], optarg, *s);
} else {
*s = tempint;
}
break;
case 'B': tempint = atoi(optarg);
if (tempint < 1 || tempint > 1000 || tempint > *n) {
fprintf(stderr,"%s: Cannot use number of blocks %s;\n Using %d\n", argv[0], optarg, *nb);
} else {
*nb = tempint;
}
break;
case 'a': *a = atof(optarg);
break;
case 'b': *b = atof(optarg);
break;
case 'h':
puts("Calculates y[i] = a*x[i] + b on the GPU.");
puts("Options: ");
puts(" --nvals=N (-n N): Set the number of values in y,x.");
puts(" --batchsize=N (-s N): Set the number of values to transfer at a time.");
puts(" --nblocks=N (-B N): Set the number of blocks used.");
puts(" --a=X (-a X): Set the parameter a.");
puts(" --b=X (-b X): Set the parameter b.");
puts(" --niters=N (-I X): Set number of iterations to calculate.");
puts("");
return +1;
}
}
return 0;
}
void tick(struct timeval *timer) {
gettimeofday(timer, NULL);
}
double tock(struct timeval *timer) {
struct timeval now;
gettimeofday(&now, NULL);
return (now.tv_usec-timer->tv_usec)/1.0e6 + (now.tv_sec - timer->tv_sec);
}
Running this one gets:
$ ./batched-saxpb --nvals=10240 --batchsize=10240 --nblocks=20
Y = a*X + b, problemsize = 10240
CPU time = 0.072 millisec.
GPU time = 0.117 millisec (done with 1 batches of 10240).
CUDA and CPU results differ by 0.000000
$ ./batched-saxpb --nvals=10240 --batchsize=5120 --nblocks=20
Y = a*X + b, problemsize = 10240
CPU time = 0.066 millisec.
GPU time = 0.133 millisec (done with 2 batches of 5120).
CUDA and CPU results differ by 0.000000
$ ./batched-saxpb --nvals=10240 --batchsize=2560 --nblocks=20
Y = a*X + b, problemsize = 10240
CPU time = 0.067 millisec.
GPU time = 0.167 millisec (done with 4 batches of 2560).
CUDA and CPU results differ by 0.000000
The GPU time goes up in this case (we're doing more memory copies) but the answers stay the same.
Edited: The original version of this code had an option for running multiple iterations of the kernel for timing purposes, but that's unnecessarily confusing in this context so it's removed.

How to fix Invalid arguments during creation of MPI derived Datatypes

I have one structure xyz as given below struct xyz { char a; int32_t b; char c[50]; uint32_t d; uchar e[10];}
I need to broadcast it so I used MPI_Bcast() where i required MPI Datatype corresponding to struct xyz for that I used MPI_Type_creat_struct() function to create a new MPI datatype as MPI_Datatype MPI_my_new_datatype, oldtypes[4]; where I used MPI datatypes corresponding to above structure members datatype as followings
oldtypes[4] = {MPI_CHAR, MPI_INT, MPI_UNSIGNED, MPI_UNSIGNED_CHAR}; and to craete new datatype i used following arguments in the function..
MPI_Type_create_struct(4,blockcounts, offsets, oldtypes, &MPI_my_new_datatype); MPI_Type_commit(&MPI_my_new_datatype);
Now it is compiling but giving run time error as below::
* An error occurred in MPI_Type_create_structon communicator MPI_COMM_WORLD MPI_ERR_ARG: invalid argument of some other kind MPI_ERRORS_ARE_FATAL (goodbye).
Can any one find out where is the problem?

You can't "bundle up" the similar types like that. Each field needs to be addressed seperately, and there are 5 of them, not 4.
Also note that, in general, it's a good idea to actually "measure" the offsets rather than infer them.
The following works:
#include <stdio.h>
#include <mpi.h>
#include <stdint.h>
struct xyz_t {
char a; int32_t b; char c[50]; uint32_t d; unsigned char e[10];
};
int main(int argc, char **argv) {
int rank, size, ierr;
MPI_Datatype oldtypes[5] = {MPI_CHAR, MPI_INT, MPI_CHAR, MPI_UNSIGNED, MPI_UNSIGNED_CHAR};
int blockcounts[5] = {1, 1, 50, 1, 10};
MPI_Datatype my_mpi_struct;
MPI_Aint offsets[5];
struct xyz_t old, new;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
/* find offsets */
offsets[0] = (char*)&(old.a) - (char*)&old;
offsets[1] = (char*)&(old.b) - (char*)&old;
offsets[2] = (char*)&(old.c) - (char*)&old;
offsets[3] = (char*)&(old.d) - (char*)&old;
offsets[4] = (char*)&(old.e) - (char*)&old;
MPI_Type_create_struct(5, blockcounts, offsets, oldtypes, &my_mpi_struct);
MPI_Type_commit(&my_mpi_struct);
if (rank == 0) {
old.a = 'a';
old.b = (int)'b';
strcpy(old.c,"This is field c");
old.d = (unsigned int)'d';
strcpy(old.e,"Field e");
MPI_Send(&old, 1, my_mpi_struct, 1, 1, MPI_COMM_WORLD);
} else if (rank == 1) {
MPI_Status status;
MPI_Recv(&new, 1, my_mpi_struct, 0, 1, MPI_COMM_WORLD, &status);
printf("new.a = %c\n", new.a);
printf("new.b = %d\n", new.b);
printf("new.e = %s\n", new.e);
}
MPI_Type_free(&my_mpi_struct);
MPI_Finalize();
return 0;
}
Running:
$ mpirun -np 2 ./struct
new.a = a
new.b = 98
new.e = Field e
Updated: As Dave Goodell below points out, the offset calculations would be better done as
#include <stddef.h>
/* ... */
offsets[0] = offsetof(struct xyz_t,a);
offsets[1] = offsetof(struct xyz_t,b);
offsets[2] = offsetof(struct xyz_t,c);
offsets[3] = offsetof(struct xyz_t,d);
offsets[4] = offsetof(struct xyz_t,e);
and if your MPI supports it (most should, though OpenMPI was slow with some of the MPI2.2 types) the MPI_UNSIGNED should be replaced with an MPI_UINT32

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

report PGCC-S-0000-Internal errors for _mp_malloc while there's no heap allocations - openacc

I've found the reason for this. For array_size d[q_length];, it actually calls the _mp_malloc to allocate memory on stack. I tried to replace q_length to constant and it works fine. (although this is a temporary tweak, but I find out he reason finally)

Related

Generate functions at compile time

Is there a way to help auto-vectorizing compiler to emit saturation arithmetic intrinsic in LLVM?

Is there a memset-like function which can set integer value in visual studio?

Piecemeal processing of a matrix - CUDA

How to fix Invalid arguments during creation of MPI derived Datatypes

Categories

Resources