Halide: Filter elements out of vector (Halide::Runtime::Buffer) - halide

I have a Halide::Runtime::Buffer and would like to remove elements that match a criteria, ideally such that the operation occurs in-place and that the function can be defined in a Halide::Generator.
I have looked into using reductions, but it seems to me that I cannot output a vector of a different length -- I can only set certain elements to a value of my choice.
So far, the only way I got it to work was by using a extern "C" call and passing the Buffer I wanted to filter, along with a boolean Buffer (1's and 0's as ints). I read the Buffers into vectors of another library (Armadillo), conducted my desired filter, then read the filtered vector back into Halide.
This seems quite messy and also, with this code, I'm passing a Halide::Buffer object, and not a Halide::Runtime::Buffer object, so I don't know how to implement this within a Halide::Generator.
So my question is twofold:
Can this kind of filtering be achieved in pure Halide, preferably in-place?
Is there an example of using extern "C" functions within Generators?

The first part is effectively stream compaction. It can be done in Halide, though the output size will either need to be fixed or a function of the input size (e.g. the same size as the input). One can get the max index produced as output as well to indicate how many results were produced. I wrote up a bit of an answer on how to do a prefix sum based stream compaction here: Halide: Reduction over a domain for the specific values . It is an open question how to do this most efficiently in parallel across a variety of targets and we hope to do some work on exploring that space soon.
Whether this is in-place or not depends on whether one can put everything into a single series of update definitions for a Func. E.g. It cannot be done in-place on an input passed into a Halide filter because reductions always allocate a buffer to work on. It may be possible to do so if the input is produced inside the Generator.
Re: the second question, are you using define_extern? This is not super well integrated with Halide::Runtime::Buffer as the external function must be implemented with halide_buffer_t but it is fairly straight forward to access from within a Generator. We don't have a tutorial on this yet, but there are a number of examples in the tests. E.g.:
https://github.com/halide/Halide/blob/master/test/generator/define_extern_opencl_generator.cpp#L19
and the definition:
https://github.com/halide/Halide/blob/master/test/generator/define_extern_opencl_aottest.cpp#L119
(These do not need to be extern "C" as I implemented C++ name mangling a while back. Just set the name mangling parameter to define_extern to NameMangling::CPlusPlus and remove the extern "C" from the external function's declaration. This is very useful as it gets one link time type checking on the external function, which catches a moderately frequent class of errors.)

Related

Mutable data types that use stack allocation

Based on my earlier question, I understand the benefit of using stack allocation. Suppose I have an array of arrays. For example, A is a list of matrices and each element A[i] is a 1x3 matrix. The length of A and the dimension of A[i] are known at run time (given by the user). Each A[i] is a matrix of Float64 and this is also known at run time. However, through out the program, I will be modifying the values of A[i] element by element. What data structure can also allow me to use stack allocation? I tried StaticArrays but it doesn't allow me to modify a static array.
StaticArrays defines MArray (MVector, MMatrix) types that are fixed-size and mutable. If you use these there's a higher chance of the compiler determining that they can be stack-allocated, but it's not guaranteed. Moreover, since the pattern you're using is that you're passing the mutable state vector into a function which presumably modifies it, it's not going to be valid or helpful to stack allocate that anyway. If you're going to allocate state once and modify it throughout the program, it doesn't really matter if it is heap or stack allocated—stack allocation is only a big win for objects that are allocated, used locally and then don't escape the local scope, so they can be “freed” simply by popping the stack.
From the code snippet you showed in the linked question, the state vector is allocated in the outer function, test_for_loop, which shouldn't be a big deal since it's done once at the beginning of execution. Using a variably sized state vector to index into an array with a splat (...) might be an issue, however, and that's done in test_function. Using something with fixed size like MVector might be better for that. It might, however, be better still, to use a state tuple and return a new rather than mutated state tuple at the end. The compiler is very good at turning that kind of thing into very efficient code because of immutability.
Note that by convention test_function should be called test_function! since it modifies its M argument and even more so if it modifies the state vector.
I would also note that this isn't a great question/answer pair since it's not standalone at all and really just a continuation of your other question. StackOverflow isn't very good for this kind of iterative question/discussion interaction, I'm afraid.

Using MPI_Send/Recv to handle chunk of multi-dim array in Fortran 90

I have to send and receive (MPI) a chunk of a multi-dimensional array in FORTRAN 90. The line
MPI_Send(x(2:5,6:8,1),12,MPI_Real,....)
is not supposed to be used, as per the book "Using MPI..." by Gropp, Lusk, and Skjellum. What is the best way to do this? Do I have to create a temporary array and send it or use MPI_Type_Create_Subarray or something like that?
The reason not to use array sections with MPI_SEND is that the compiler has to create a temporary copy with some MPI implementations. This is due to the fact that Fortran can only properly pass array sections to subroutines with explicit interfaces and has to generate temporary "flattened" copies in all other cases, usually on the stack of the calling subroutine. Unfortunately in Fortran before the TR 29113 extension to F2008 there is no way to declare subroutines that take variable type arguments and MPI implementations usually resort to language hacks, e.g. MPI_Send is entirely implemented in C and relies on Fortran always passing the data as a pointer.
Some MPI libraries work around this issue by generating huge number of overloads for MPI_SEND:
one that takes a single INTEGER
one that takes an 1-d array of INTEGER
one that takes an 2-d array of INTEGER
and so on
The same is then repeated for CHARACTER, LOGICAL, DOUBLE PRECISION, etc. This is still a hack as it does not cover cases where one passes user-defined type. Further it greatly complicates the C implementation as it now has to understand the Fortran array descriptors, which are very compiler-specific.
Fortunately times are changing. The TR 29113 extension to Fortran 2008 includes two new features:
assumed-type arguments: TYPE(*)
assumed-dimension arguments: DIMENSION(..)
The combination of both, i.e. TYPE(*), DIMENSION(..), INTENT(IN) :: buf, describes an argument that can both be of varying type and have any dimension. This is already being taken advantage of in the new mpi_f08 interface in MPI-3.
Non-blocking calls present bigger problems in Fortran that go beyond what Alexander Vogt has described. The reason is that Fortran does not have the concept of suppressing compiler optimisations (i.e. there is no volatile keyword in Fortran). The following code might not run as expected:
INTEGER :: data
data = 10
CALL MPI_IRECV(data, 1, MPI_INTEGER, 0, 0, MPI_COMM_WORLD, req, ierr)
! data is not used here
! ...
CALL MPI_WAIT(req, MPI_STATUS_IGNORE, ierr)
! data is used here
One might expect that after the call to MPI_WAIT data would contain the value received from rank 0, but this might very well not be the case. The reason is that the compiler cannot know that data might change asynchronously after MPI_IRECV returns and therefore keep its value in a register instead. That's why non-blocking MPI calls are generally considered as dangerous in Fortran.
TR 29113 has solution for that second problem too with the ASYNCHRONOUS attribute. If you take a look at the mpi_f08 definition of MPI_IRECV, its buf argument is declared as:
TYPE(*), DIMENSION(..), INTENT(OUT), ASYNCHRONOUS :: buf
Even if buf is a scalar argument, i.e. no temporary copy is created, a TR 29113 compliant compiler would not resort to register optimisations for the buffer argument.
EDIT: As Hristo Iliev pointed out MPI_Send is always blocking, but might choose to send data asynchronously. From here:
MPI_Send will not return until you can use the send buffer.
Non-blocking communications (like MPI_Send), might pose a problem with Fortran when non-contiguous arrays are involved. Then, the compiler creates a temporary array for the dummy variable and passes it to the subroutine. Once the subroutine is finished, the compiler is at liberty to free the memory of that copy.
That's fine as long as you use blocking communication (MPI_Send), because then the message has been sent when the subroutine returns. For the non-blocking communication (MPI_Isend), however, the temporary array is the send buffer, and the subroutine returns before it has been sent.
So it might happen, that MPI will send data from a memory location that holds no valid data any more.
So, either you create a copy yourself (so that your send buffer is contiguous in memory), or you create a sub-array (i.e. tell MPI the addresses in memory of elements you want to send). There are further alternatives out there, like MPI_Pack, but I have no experience with them.
Which way is faster? Well, that depends:
On the actual implementation of your MPI library
On the data and its distribution
On your compiler
On your hardware
See here for a detailed explanation and further options.

vector --> concurrent_vector migration + OpenGL restriction

I need to speed-up some calculation and result of calculation then used to draw OpenGL model.
Major speed-up archived when I changed std::vector to Concurrency::concurrent_vector and used parallel_for instead of just for loops.
This vector (or concurrent_vector) calculated in for (or parallel_for) loop and contains vertices for OpenGL to visualize.
It is fine using std::vector because OpenGL rendering procedure relies on the fact that std::vector keeps it's items in sequence which is not a case with concurrent_vector. Code runs something like this:
glVertexPointer(3, GL_FLOAT, 0, &vectorWithVerticesData[0]);
To generate concurrent_vector and copy it to std::vector is too expensive since there are lot of items.
So, the question is: I'd like to use OpenGL arrays, but also like to use concurrent_vector which is incompatible with OpenGL output.
Any suggestions?
You're trying to use a data structure that doesn't store its elements contiguously in an API that requires contiguous storage. Well, one of those has to give, and it's not going to be OpenGL. GL isn't going to walk concurrent_vector's data structure (not if you like performance).
So your option is to not use non-sequential objects.
I can only guess at what you're doing (since you didn't provide example code for the generator), so that limits what I can advise. If your parallel_for iterates for a fixed number of times (by "fixed", I mean a value that is known immediately before parallel_for executes. It doesn't change based on how many times you've iterated), then you can just use a regular vector.
Simply size the vector with vector::size. This will value-initialize the elements, which means that every element exists. You can now perform your parallel_for loop, but instead of using push_back or whatever, you simply copy the element directly into its location in the output. I think parallel_for can iterate over the actual vector iterators, but I'm not positive. Either way, it doesn't matter; you won't get any race conditions unless you try to set the same element from different threads.

Mapping Untyped Lisp data into a typed binary format for use in compiled functions

Background: I'm writing a toy Lisp (Scheme) interpreter in Haskell. I'm at the point where I would like to be able to compile code using LLVM. I've spent a couple days dreaming up various ways of feeding untyped Lisp values into compiled functions that expect to know the format of the data coming at them. It occurs to me that I am not the first person to need to solve this problem.
Question: What are some historically successful ways of mapping untyped data into an efficient binary format.
Addendum: In point of fact, I do know which of about a dozen different types the data is, I just don't know which one might be sent to the function at compile time. The function itself needs a way to determine what it got.
Do you mean, "I just don't know which [type] might be sent to the function at runtime"? It's not that the data isn't typed; certainly 1 and '() have different types. Rather, the data is not statically typed, i.e., it's not known at compile time what the type of a given variable will be. This is called dynamic typing.
You're right that you're not the first person to need to solve this problem. The canonical solution is to tag each runtime value with its type. For example, if you have a dozen types, number them like so:
0 = integer
1 = cons pair
2 = vector
etc.
Once you've done this, reserve the first four bits of each word for the tag. Then, every time two objects get passed in to +, first you perform a simple bit mask to verify that both objects' first four bits are 0b0000, i.e., that they are both integers. If they are not, you jump to an error message; otherwise, you proceed with the addition, and make sure that the result is also tagged accordingly.
This technique essentially makes each runtime value a manually-tagged union, which should be familiar to you if you've used C. In fact, it's also just like a Haskell data type, except that in Haskell the taggedness is much more abstract.
I'm guessing that you're familiar with pointers if you're trying to write a Scheme compiler. To avoid limiting your usable memory space, it may be more sensical to use the bottom (least significant) four bits, rather than the top ones. Better yet, because aligned dword pointers already have three meaningless bits at the bottom, you can simply co-opt those bits for your tag, as long as you dereference the actual address, rather than the tagged one.
Does that help?
Your default solution should be a simple tagged union. If you want to narrow your typing down to more specific types, you can do it - but it won't be that "toy" any more. A thing to look at is called abstract interpretation.
There are few successful implementations of such an optimisation, with V8 being probably the most widespread. In the Scheme world, the most aggressively optimising implementation is Stalin.

Matlab: avoiding memory allocation in mex

I'm trying to make my mex library avoid all memory allocation what so even.
Until now, the mex got an input, created some matrices using mxCreate...() and returned this output.
But now I'd like to modify this interface so that the mex itself would not do any allocations.
What I had in mind is that the mexFunction will get as input the matrix to fill values into and return this very same matrix as an output.
Is this supposed to be possible?
The slight alarm that got me thinking if this is at all something I need to be doing is that the left hand arguments come to the mexFunction as const and the right hand argument are non-const. to return the input matrix as an output I'll need to remove this const.
Funnily enough I was just looking at this the other day. The best info I found was threads here and here and also this.
Basically it is generally considered a very bad thing in Matlab world... but at the same time, nothing stops you so you can do it - try some simple examples and you will see that the changes are propogated. Just make changes to the data you get from prhs (you don't need to return anything - since you changed the raw data it will be reflected in the variable in the workspace).
However as pointed out in the links, this can have strange consequences, because of Matlabs copy-on-write semantics. Setting format debug can help a lot with getting intuition on this. If you do a=b then you will see a and b have different 'structure addresses' or headers, representing the fact that they are different variables, but the data pointer, pr, points to the same area in memory. Normally, if you change y in Matlab, copy-on-write kicks in and the data area is copied before being changed, so after y has a new data pointer. When you change things in mex this doesn't happen, so if you changed y, x would also change.
I think it's OK to do it - it's incredibly useful if you need to handle large datasets, but you need to keep an eye out for any oddness - try to make sure the data your putting in isn't shared among variables. Things get even more complicated with struct and cell arrays so I would be more inclined to avoid doing it to those.
Modifying the right-hand arguments would be a bad idea. Those inputs can be reference counted, and if you modify them when the reference count is greater than one, then you will be silently modifying the value stored in other variables as well.
Unfortunately, I don't believe there is a way to do what you want given the existing MEX API.

Resources