Using MPI_Send/Recv to handle chunk of multi-dim array in Fortran 90 - parallel-processing

I have to send and receive (MPI) a chunk of a multi-dimensional array in FORTRAN 90. The line
MPI_Send(x(2:5,6:8,1),12,MPI_Real,....)
is not supposed to be used, as per the book "Using MPI..." by Gropp, Lusk, and Skjellum. What is the best way to do this? Do I have to create a temporary array and send it or use MPI_Type_Create_Subarray or something like that?

The reason not to use array sections with MPI_SEND is that the compiler has to create a temporary copy with some MPI implementations. This is due to the fact that Fortran can only properly pass array sections to subroutines with explicit interfaces and has to generate temporary "flattened" copies in all other cases, usually on the stack of the calling subroutine. Unfortunately in Fortran before the TR 29113 extension to F2008 there is no way to declare subroutines that take variable type arguments and MPI implementations usually resort to language hacks, e.g. MPI_Send is entirely implemented in C and relies on Fortran always passing the data as a pointer.
Some MPI libraries work around this issue by generating huge number of overloads for MPI_SEND:
one that takes a single INTEGER
one that takes an 1-d array of INTEGER
one that takes an 2-d array of INTEGER
and so on
The same is then repeated for CHARACTER, LOGICAL, DOUBLE PRECISION, etc. This is still a hack as it does not cover cases where one passes user-defined type. Further it greatly complicates the C implementation as it now has to understand the Fortran array descriptors, which are very compiler-specific.
Fortunately times are changing. The TR 29113 extension to Fortran 2008 includes two new features:
assumed-type arguments: TYPE(*)
assumed-dimension arguments: DIMENSION(..)
The combination of both, i.e. TYPE(*), DIMENSION(..), INTENT(IN) :: buf, describes an argument that can both be of varying type and have any dimension. This is already being taken advantage of in the new mpi_f08 interface in MPI-3.
Non-blocking calls present bigger problems in Fortran that go beyond what Alexander Vogt has described. The reason is that Fortran does not have the concept of suppressing compiler optimisations (i.e. there is no volatile keyword in Fortran). The following code might not run as expected:
INTEGER :: data
data = 10
CALL MPI_IRECV(data, 1, MPI_INTEGER, 0, 0, MPI_COMM_WORLD, req, ierr)
! data is not used here
! ...
CALL MPI_WAIT(req, MPI_STATUS_IGNORE, ierr)
! data is used here
One might expect that after the call to MPI_WAIT data would contain the value received from rank 0, but this might very well not be the case. The reason is that the compiler cannot know that data might change asynchronously after MPI_IRECV returns and therefore keep its value in a register instead. That's why non-blocking MPI calls are generally considered as dangerous in Fortran.
TR 29113 has solution for that second problem too with the ASYNCHRONOUS attribute. If you take a look at the mpi_f08 definition of MPI_IRECV, its buf argument is declared as:
TYPE(*), DIMENSION(..), INTENT(OUT), ASYNCHRONOUS :: buf
Even if buf is a scalar argument, i.e. no temporary copy is created, a TR 29113 compliant compiler would not resort to register optimisations for the buffer argument.

EDIT: As Hristo Iliev pointed out MPI_Send is always blocking, but might choose to send data asynchronously. From here:
MPI_Send will not return until you can use the send buffer.
Non-blocking communications (like MPI_Send), might pose a problem with Fortran when non-contiguous arrays are involved. Then, the compiler creates a temporary array for the dummy variable and passes it to the subroutine. Once the subroutine is finished, the compiler is at liberty to free the memory of that copy.
That's fine as long as you use blocking communication (MPI_Send), because then the message has been sent when the subroutine returns. For the non-blocking communication (MPI_Isend), however, the temporary array is the send buffer, and the subroutine returns before it has been sent.
So it might happen, that MPI will send data from a memory location that holds no valid data any more.
So, either you create a copy yourself (so that your send buffer is contiguous in memory), or you create a sub-array (i.e. tell MPI the addresses in memory of elements you want to send). There are further alternatives out there, like MPI_Pack, but I have no experience with them.
Which way is faster? Well, that depends:
On the actual implementation of your MPI library
On the data and its distribution
On your compiler
On your hardware
See here for a detailed explanation and further options.

Related

how can i further understand the compilation process used by gcc?

I was trying to reverse engineer some psp programs developed using the free
pspsdk
https://sourceforge.net/projects/minpspw/
I noticed that i created a function to see how MIPS handles more than 4 arguments (a0-a4).
Everyone i know has told me that they get passed onto the stack.
To my surprise, that 5th argument was actually passed to register t0 and to compiler didn't even use the stack!
it also inlined a function without even having used a jal or jump to it. (obvious optimization).
Altough there was indeed a space a memory and you could double check by using print with function pointer argument. That actual code that was executed was automatically inlined without the need of a function call instruction.
^^ but that doesn't really benefit me for a reverse engineer attempt...
there is a man page for this version of gcc. and it takes seconds to install if anyone is able to provide it's man for compilation if there is one.
It's so long i don't even know how to reference information reliably
How arguments are passed is specified by the ABI (application binary interface). So you have to find respective documents.
Moreover, there is more than one such ABI, namely n32 and n64. In the case of mips-gcc, some of the decisions are commented in the GCC sources like in ./gcc/config/mips/mips.h
/* This structure has to cope with two different argument allocation
schemes. Most MIPS ABIs view the arguments as a structure, of which
the first N words go in registers and the rest go on the stack. If I
< N, the Ith word might go in Ith integer argument register or in a
floating-point register. For these ABIs, we only need to remember
the offset of the current argument into the structure.
The EABI instead allocates the integer and floating-point arguments
separately. The first N words of FP arguments go in FP registers,
the rest go on the stack. Likewise, the first N words of the other
arguments go in integer registers, and the rest go on the stack. We
need to maintain three counts: the number of integer registers used,
the number of floating-point registers used, and the number of words
passed on the stack.
We could keep separate information for the two ABIs (a word count for
the standard ABIs, and three separate counts for the EABI). But it
seems simpler to view the standard ABIs as forms of EABI that do not
allocate floating-point registers.
So for the standard ABIs, the first N words are allocated to integer
registers, and mips_function_arg decides on an argument-by-argument
basis whether that argument should really go in an integer register,
or in a floating-point one. */
There are more such comments in the mips backend. Search for "cumulative" or "CUMULATIVE" in mips.c and mips.h.

Halide: Filter elements out of vector (Halide::Runtime::Buffer)

I have a Halide::Runtime::Buffer and would like to remove elements that match a criteria, ideally such that the operation occurs in-place and that the function can be defined in a Halide::Generator.
I have looked into using reductions, but it seems to me that I cannot output a vector of a different length -- I can only set certain elements to a value of my choice.
So far, the only way I got it to work was by using a extern "C" call and passing the Buffer I wanted to filter, along with a boolean Buffer (1's and 0's as ints). I read the Buffers into vectors of another library (Armadillo), conducted my desired filter, then read the filtered vector back into Halide.
This seems quite messy and also, with this code, I'm passing a Halide::Buffer object, and not a Halide::Runtime::Buffer object, so I don't know how to implement this within a Halide::Generator.
So my question is twofold:
Can this kind of filtering be achieved in pure Halide, preferably in-place?
Is there an example of using extern "C" functions within Generators?
The first part is effectively stream compaction. It can be done in Halide, though the output size will either need to be fixed or a function of the input size (e.g. the same size as the input). One can get the max index produced as output as well to indicate how many results were produced. I wrote up a bit of an answer on how to do a prefix sum based stream compaction here: Halide: Reduction over a domain for the specific values . It is an open question how to do this most efficiently in parallel across a variety of targets and we hope to do some work on exploring that space soon.
Whether this is in-place or not depends on whether one can put everything into a single series of update definitions for a Func. E.g. It cannot be done in-place on an input passed into a Halide filter because reductions always allocate a buffer to work on. It may be possible to do so if the input is produced inside the Generator.
Re: the second question, are you using define_extern? This is not super well integrated with Halide::Runtime::Buffer as the external function must be implemented with halide_buffer_t but it is fairly straight forward to access from within a Generator. We don't have a tutorial on this yet, but there are a number of examples in the tests. E.g.:
https://github.com/halide/Halide/blob/master/test/generator/define_extern_opencl_generator.cpp#L19
and the definition:
https://github.com/halide/Halide/blob/master/test/generator/define_extern_opencl_aottest.cpp#L119
(These do not need to be extern "C" as I implemented C++ name mangling a while back. Just set the name mangling parameter to define_extern to NameMangling::CPlusPlus and remove the extern "C" from the external function's declaration. This is very useful as it gets one link time type checking on the external function, which catches a moderately frequent class of errors.)

Fortran unformatted I/O optimization

I'm working on a set of Fortran programs that are heavily I/O bound, and so am trying to optimize this. I've read at multiple places that writing entire arrays is faster than individual elements, i.e. WRITE(10)arr is faster than DO i=1,n; WRITE(10) arr(i); ENDDO. But, I'm unclear where my case would fall in this regard. Conceptually, my code is something like:
OPEN(10,FILE='testfile',FORM='UNFORMATTED')
DO i=1,n
[calculations to determine m values stored in array arr]
WRITE(10) m
DO j=1,m
WRITE(10) arr(j)
ENDDO
ENDDO
But m may change each time through the DO i=1,n loop such that writing the whole array arr isn't an option. So, collapsing the DO loop for writing would end up with WRITE(10) arr(1:m), which isn't the same as writing the whole array. Would this still provide a speed-up to writing, what about reading? I could allocate an array of size m after the calculations, assign the values to that array, write it, then deallocate it, but that seems too involved.
I've also seen differing information on implied DO loop writes, i.e. WRITE(10) (arr(j),j=1,m), as to whether they help/hurt on I/O overhead.
I'm running a couple of tests now, and intend to update with my observations. Other suggestions on applicable
Additional details:
The first program creates a large file, the second reads it. And, no, merging the two programs and keeping everything in memory isn't a valid option.
I'm using unformatted I/O and have access to the Portland Group and gfortran compilers. It's my understanding the PG's is generally faster, so that's what I'm using.
The output file is currently ~600 GB, the codes take several hours to run.
The second program (reading in the file) seems especially costly. I've monitored the system and seen that it's mostly CPU-bound, even when I reduce the code to little more than reading the file, indicating that there is very significant CPU overhead on all the I/O calls when each value is read in one-at-a-time.
Compiler flags: -O3 (high optimization) -fastsse (various performance enhancements, optimized for SSE hardware) -Mipa=fast,inline (enables aggressive inter-procedural analysis/optimization on compiler)
UPDATE
I ran the codes with WRITE(10) arr(1:m) and READ(10) arr(1:m). My tests with these agreed, and showed a reduction in runtime of about 30% for the WRITE code, the output file is also slightly less than half the original's size. For the second code, reading in the file, I made the code do basically nothing but read the file to compare pure read time. This reduced the run time by a factor of 30.
If you use normal unformatted (record-oriented) I/O, you also write a record marker before and after the data itself. So you add eight bytes (usually) of overhead to each data item, which can easily (almost) double the data written to disc if your number is a double precision. The runtime overhead mentioned in the other answers is also significant.
The argument above does not apply if you use unformatted stream.
So, use
WRITE (10) m
WRITE (10) arr(1:m)
For gfortran, this is faster than an implied DO loop (i.e. the solution WRITE (10) (arr(i),i=1,m)).
In the suggested solution, an array descriptor is built and passed to the library with a single call. I/O can then be done much more efficiently, in your case taking advantage of the fact that the data is contiguous.
For the implied DO loop, gfortran issues multiple library calls, with much more overhead. This could be optimized, and is subject of a long-standing bug report, PR 35339, but some complicated corner cases and the presence of a viable alternative have kept this from being optimized.
I would also suggest doing I/O in stream access, not because of the rather insignificant saving in space (see above) but because keeping up the leading record marker up to date on writing needs a seek, which is additional effort.
If your data size is very large, above ~ 2^31 bytes, you might run into different behavior with record markers. gfortran uses subrecords in this case (compatible to Intel), but it should just work. I don't know what Portland does in this case.
For reading, of course, you can read m, then allocate an allocatable array, then read the whole array in one READ statement.
The point of avoiding outputting an array by looping over multiple WRITE() operations is to avoid the multiple WRITE() operations. It's not particularly important that the data being output are all the members of the array.
Writing either an array section or a whole array via a single WRITE() operation is a good bet. An implied DO loop cannot be worse than an explicit outer loop, but whether it's any better is a question of compiler implementation. (Though I'd expect the implied-DO to be better than an outer loop.)

When sending data using MPI in Fortran, does it have to be an array?

This is a pretty simple question. I'm practicing using MPI in Fortran, and as far as I can tell, the sending and receiving sources must be arrays, like this:
do 20 i = 1, 25
final_sum(1) = final_sum(1) + random_numbers(i)
20 continue
print *, 'process final sum: ', final_sum(1)
call MPI_GATHER(final_sum,1,MPI_INTEGER,sum_recv_buff,1,MPI_INTEGER,0,MPI_COMM_WORLD,ierror)
Here, I'm using the array final_sum to hold a single value. Is there a better way to do this? This is my first time using Fortran, and since I've already done some practice in C, I was trying out Fortran to see the differences and similarities.
High Performance Mark is right. There is actually normally no type checking when calling MPI routines that work with buffers. The MPI routines make use of the Fortran feature that enables calling procedures with implicit interface. On machine language level just a pointer is passed (it may be a pointer to a temporary copy!). That means you can use scalars or arrays without any problem. Just use count 1 and the correct MPI type and you can pass and receive scalars.
The MPI derived types is a feature that will enable you to work with even more complicated pieces of data.

Mapping Untyped Lisp data into a typed binary format for use in compiled functions

Background: I'm writing a toy Lisp (Scheme) interpreter in Haskell. I'm at the point where I would like to be able to compile code using LLVM. I've spent a couple days dreaming up various ways of feeding untyped Lisp values into compiled functions that expect to know the format of the data coming at them. It occurs to me that I am not the first person to need to solve this problem.
Question: What are some historically successful ways of mapping untyped data into an efficient binary format.
Addendum: In point of fact, I do know which of about a dozen different types the data is, I just don't know which one might be sent to the function at compile time. The function itself needs a way to determine what it got.
Do you mean, "I just don't know which [type] might be sent to the function at runtime"? It's not that the data isn't typed; certainly 1 and '() have different types. Rather, the data is not statically typed, i.e., it's not known at compile time what the type of a given variable will be. This is called dynamic typing.
You're right that you're not the first person to need to solve this problem. The canonical solution is to tag each runtime value with its type. For example, if you have a dozen types, number them like so:
0 = integer
1 = cons pair
2 = vector
etc.
Once you've done this, reserve the first four bits of each word for the tag. Then, every time two objects get passed in to +, first you perform a simple bit mask to verify that both objects' first four bits are 0b0000, i.e., that they are both integers. If they are not, you jump to an error message; otherwise, you proceed with the addition, and make sure that the result is also tagged accordingly.
This technique essentially makes each runtime value a manually-tagged union, which should be familiar to you if you've used C. In fact, it's also just like a Haskell data type, except that in Haskell the taggedness is much more abstract.
I'm guessing that you're familiar with pointers if you're trying to write a Scheme compiler. To avoid limiting your usable memory space, it may be more sensical to use the bottom (least significant) four bits, rather than the top ones. Better yet, because aligned dword pointers already have three meaningless bits at the bottom, you can simply co-opt those bits for your tag, as long as you dereference the actual address, rather than the tagged one.
Does that help?
Your default solution should be a simple tagged union. If you want to narrow your typing down to more specific types, you can do it - but it won't be that "toy" any more. A thing to look at is called abstract interpretation.
There are few successful implementations of such an optimisation, with V8 being probably the most widespread. In the Scheme world, the most aggressively optimising implementation is Stalin.

Resources