vector --> concurrent_vector migration + OpenGL restriction - windows

I need to speed-up some calculation and result of calculation then used to draw OpenGL model.
Major speed-up archived when I changed std::vector to Concurrency::concurrent_vector and used parallel_for instead of just for loops.
This vector (or concurrent_vector) calculated in for (or parallel_for) loop and contains vertices for OpenGL to visualize.
It is fine using std::vector because OpenGL rendering procedure relies on the fact that std::vector keeps it's items in sequence which is not a case with concurrent_vector. Code runs something like this:
glVertexPointer(3, GL_FLOAT, 0, &vectorWithVerticesData[0]);
To generate concurrent_vector and copy it to std::vector is too expensive since there are lot of items.
So, the question is: I'd like to use OpenGL arrays, but also like to use concurrent_vector which is incompatible with OpenGL output.
Any suggestions?

You're trying to use a data structure that doesn't store its elements contiguously in an API that requires contiguous storage. Well, one of those has to give, and it's not going to be OpenGL. GL isn't going to walk concurrent_vector's data structure (not if you like performance).
So your option is to not use non-sequential objects.
I can only guess at what you're doing (since you didn't provide example code for the generator), so that limits what I can advise. If your parallel_for iterates for a fixed number of times (by "fixed", I mean a value that is known immediately before parallel_for executes. It doesn't change based on how many times you've iterated), then you can just use a regular vector.
Simply size the vector with vector::size. This will value-initialize the elements, which means that every element exists. You can now perform your parallel_for loop, but instead of using push_back or whatever, you simply copy the element directly into its location in the output. I think parallel_for can iterate over the actual vector iterators, but I'm not positive. Either way, it doesn't matter; you won't get any race conditions unless you try to set the same element from different threads.

Related

Mutable data types that use stack allocation

Based on my earlier question, I understand the benefit of using stack allocation. Suppose I have an array of arrays. For example, A is a list of matrices and each element A[i] is a 1x3 matrix. The length of A and the dimension of A[i] are known at run time (given by the user). Each A[i] is a matrix of Float64 and this is also known at run time. However, through out the program, I will be modifying the values of A[i] element by element. What data structure can also allow me to use stack allocation? I tried StaticArrays but it doesn't allow me to modify a static array.
StaticArrays defines MArray (MVector, MMatrix) types that are fixed-size and mutable. If you use these there's a higher chance of the compiler determining that they can be stack-allocated, but it's not guaranteed. Moreover, since the pattern you're using is that you're passing the mutable state vector into a function which presumably modifies it, it's not going to be valid or helpful to stack allocate that anyway. If you're going to allocate state once and modify it throughout the program, it doesn't really matter if it is heap or stack allocated—stack allocation is only a big win for objects that are allocated, used locally and then don't escape the local scope, so they can be “freed” simply by popping the stack.
From the code snippet you showed in the linked question, the state vector is allocated in the outer function, test_for_loop, which shouldn't be a big deal since it's done once at the beginning of execution. Using a variably sized state vector to index into an array with a splat (...) might be an issue, however, and that's done in test_function. Using something with fixed size like MVector might be better for that. It might, however, be better still, to use a state tuple and return a new rather than mutated state tuple at the end. The compiler is very good at turning that kind of thing into very efficient code because of immutability.
Note that by convention test_function should be called test_function! since it modifies its M argument and even more so if it modifies the state vector.
I would also note that this isn't a great question/answer pair since it's not standalone at all and really just a continuation of your other question. StackOverflow isn't very good for this kind of iterative question/discussion interaction, I'm afraid.

Using unordered_map with key only to store pointers (dismiss value)

I'm implementing an algorithm that checks nodes in a mesh for a certain value. To store information on which node I have already checked I'd like to use an unordered_map with the pointer to the node as a key. I can then simply use umap.find(pointer) to see if the node was already checked and skip it. This way I can accomplish it in O(n) time.
However I don't need to actually store a value for the map. The key itself is enough information. Is std::unordered_map even the right solution then? If so, what should I put for the "value" field maximize performace? I have a 32bit embedded system, so I thought of just putting uint32_t or uint_fast32_t there.
tl;dr:
Is std::unordered_map the right tool to store keys without values?
Will the native hash function work well for pointers? Or would you suggest a different hashin algorithm?
What do I put as "value" for the map if using std::unordered_map to optimize for performance?
Is std::unordered_map the right tool to store keys without values?
I would use a std::unordered_set in these situations.
Will the native hash function work well for pointers?
Yes. It is most likely just a cast from pointer to std::size_t.
What do I put as "value" for the map if using std::unordered_map to optimize for performance?
If you use a std::unordered_set instead, there is no value, only the pointers.
Is std::unordered_map the right tool to store keys without values?
No - std::unordered_set is the one to use when you don't have distinct keys and values.
Will the native hash function work well for pointers? Or would you suggest a different hashin algorithm?
The "native" compiler-supplied hash function probably casts the pointer value to size_t - a kind of identity hash. That may or may not work well depending on the compromises your Standard Library has chosen. GCC and clang use prime numbers of buckets in the hash table, so it will work fine. Visual C++ (and many non-Standard hash table implementations) use powers of two (i.e. 128, 256, 512...). Powers of two are used because it's very fast to map them on to buckets - just AND with a bitwise mask (127, 255, 511) to retain however-many less-significant bits you need. The problem with doing that with pointers is that often the pointed-to objects have some alignment, so they may all be multiples of e.g. 4 or 8. A multiple of 8 always has the three least significant bits set to 0: those bits don't contribute to the randomised placement of the value in a bucket. Instead, only every 8th bucket will receive any share of the elements being hashed. If you have an implementation like this, then you're probably better off using a better hash function. At the least, you could say bit-shift the pointer values right by enough to remove the known zeros.
What do I put as "value" for the map if using std::unordered_map to optimize for performance?
Again, you should use an std::unordered_set, so don't have to worry about a value.

Halide: Filter elements out of vector (Halide::Runtime::Buffer)

I have a Halide::Runtime::Buffer and would like to remove elements that match a criteria, ideally such that the operation occurs in-place and that the function can be defined in a Halide::Generator.
I have looked into using reductions, but it seems to me that I cannot output a vector of a different length -- I can only set certain elements to a value of my choice.
So far, the only way I got it to work was by using a extern "C" call and passing the Buffer I wanted to filter, along with a boolean Buffer (1's and 0's as ints). I read the Buffers into vectors of another library (Armadillo), conducted my desired filter, then read the filtered vector back into Halide.
This seems quite messy and also, with this code, I'm passing a Halide::Buffer object, and not a Halide::Runtime::Buffer object, so I don't know how to implement this within a Halide::Generator.
So my question is twofold:
Can this kind of filtering be achieved in pure Halide, preferably in-place?
Is there an example of using extern "C" functions within Generators?
The first part is effectively stream compaction. It can be done in Halide, though the output size will either need to be fixed or a function of the input size (e.g. the same size as the input). One can get the max index produced as output as well to indicate how many results were produced. I wrote up a bit of an answer on how to do a prefix sum based stream compaction here: Halide: Reduction over a domain for the specific values . It is an open question how to do this most efficiently in parallel across a variety of targets and we hope to do some work on exploring that space soon.
Whether this is in-place or not depends on whether one can put everything into a single series of update definitions for a Func. E.g. It cannot be done in-place on an input passed into a Halide filter because reductions always allocate a buffer to work on. It may be possible to do so if the input is produced inside the Generator.
Re: the second question, are you using define_extern? This is not super well integrated with Halide::Runtime::Buffer as the external function must be implemented with halide_buffer_t but it is fairly straight forward to access from within a Generator. We don't have a tutorial on this yet, but there are a number of examples in the tests. E.g.:
https://github.com/halide/Halide/blob/master/test/generator/define_extern_opencl_generator.cpp#L19
and the definition:
https://github.com/halide/Halide/blob/master/test/generator/define_extern_opencl_aottest.cpp#L119
(These do not need to be extern "C" as I implemented C++ name mangling a while back. Just set the name mangling parameter to define_extern to NameMangling::CPlusPlus and remove the extern "C" from the external function's declaration. This is very useful as it gets one link time type checking on the external function, which catches a moderately frequent class of errors.)

Partially std::move a vector? Or how to split without new memory allocation?

When we move std::vector we just steal its content. So this code:
std::vector<MyClass> v{ std::move(tmpVec) };
will not allocate new memory, will not call any of constructors of MyClass.
But what if I want to split a temporary vector? In theory, I could steal the content as we did before and distribute it among new vectors. In practice I can't do this. The best so far solution I found is to use std::move() from <algorithm> header. But here the operator new will be called for every new vector. Additionally, move constructor (if available) will be called for every element we move.
What else can I do (c++17 counts)?
In theory, I could steal the content as we did before and distribute it among new vectors.
No, you cannot.
A memory allocation cannot be broken up into multiple memory allocations. At least, not without doing multiple memory allocations, then copying/moving the elements from the original into those separate pieces.
You cannot create separate vectors that have different storage without actually copying/moving the elements to those different memory buffers. You can of course take separate ranges of that vector and do whatever you can with such ranges (iterator/pointer pairs, gsl::span, etc). But each range would always be referencing elements ultimately owned by the source vector; they cannot independently own subranges of a vector.
You can write a span class that stores two pointers, and does not own the data between them. It can have many vector-like operations on it.
It should also support slicing itself (without allocation) into sub components.
You can write an shared_span class that has both those two pointers, and a shared_ptr which represents (possibly shared) ownership of the underlying buffer. It should support the operations of span, except functions returning span (like without_front(std::size_t count=1)) should instead return shared_span (with shared ownership).
You can write a move constructor from vector to shared_span easily. You may even be able to write a function from shared_span to vector with a special allocator that doesn't allocate until it grows. Making that fully portable would be very difficult.
If it is possible (I am uncertain), you could take a std::vector, move its storage into a shared_ptr<std::vector>, feed that to an allocator, build two std::vector<T, special_allocator>s that use that memory, and do what you want.
But you could just replace your request for vector doing this with code that consume a shared_span. shared_span could even have a concept of extra "dead" memory before/after the buffer it is using, giving it performance approaching std::vector.
There is a span in the gsl library you could possibly use. I am unaware of a publicly available shared_span.

Efficiency of appending to vectors

Appending an element onto a million-element ArrayList has the cost of setting one reference now, and copying one reference in the future when the ArrayList must be resized.
As I understand it, appending an element onto a million-element PersistenVector must create a new path, which consists of 4 arrays of size 32. Which means more than 120 references have to be touched.
How does Clojure manage to keep the vector overhead to "2.5 times worse" or "4 times worse" (as opposed to "60 times worse"), which has been claimed in several Clojure videos I have seen recently? Has it something to do with caching or locality of reference or something I am not aware of?
Or is it somehow possible to build a vector internally with mutation and then turn it immutable before revealing it to the outside world?
I have tagged the question scala as well, since scala.collection.immutable.vector is basically the same thing, right?
Clojure's PersistentVector's have special tail buffer to enable efficient operation at the end of the vector. Only after this 32-element array is filled is it added to the rest of the tree. This keeps the amortized cost low. Here is one article on the implementation. The source is also worth a read.
Regarding, "is it somehow possible to build a vector internally with mutation and then turn it immutable before revealing it to the outside world?", yes! These are known as transients in Clojure, and are used for efficient batch changes.
Cannot tell about Clojure, but I can give some comments about Scala Vectors.
Persistent Scala vectors (scala.collection.immutable.Vectors) are much slower than an array buffer when it comes to appending. In fact, they are 10x slower than the List prepend operation. They are 2x slower than appending to Conc-trees, which we use in Parallel Collections.
But, Scala also has mutable vectors -- they're hidden in the class VectorBuilder. Appending to mutable vectors does not preserve the previous version of the vector, but mutates it in place by keeping the pointer to the rightmost leaf in the vector. So, yes -- keeping the vector mutable internally, and than returning an immutable reference is exactly what's done in Scala collections.
The VectorBuilder is slightly faster than the ArrayBuffer, because it needs to allocate its arrays only once, whereas ArrayBuffer needs to do it twice on average (because of growing). Conc.Buffers, which we use as parallel array combiners, are twice as fast compared to VectorBuilders.
Benchmarks are here. None of the benchmarks involve any boxing, they work with reference objects to avoid any bias:
comparison of Scala List, Vector and Conc
comparison of Scala ArrayBuffer, VectorBuilder and Conc.Buffer
More collections benchmarks here.
These tests were executed using ScalaMeter.

Resources