Efficiency of appending to vectors - performance

Appending an element onto a million-element ArrayList has the cost of setting one reference now, and copying one reference in the future when the ArrayList must be resized.
As I understand it, appending an element onto a million-element PersistenVector must create a new path, which consists of 4 arrays of size 32. Which means more than 120 references have to be touched.
How does Clojure manage to keep the vector overhead to "2.5 times worse" or "4 times worse" (as opposed to "60 times worse"), which has been claimed in several Clojure videos I have seen recently? Has it something to do with caching or locality of reference or something I am not aware of?
Or is it somehow possible to build a vector internally with mutation and then turn it immutable before revealing it to the outside world?
I have tagged the question scala as well, since scala.collection.immutable.vector is basically the same thing, right?

Clojure's PersistentVector's have special tail buffer to enable efficient operation at the end of the vector. Only after this 32-element array is filled is it added to the rest of the tree. This keeps the amortized cost low. Here is one article on the implementation. The source is also worth a read.
Regarding, "is it somehow possible to build a vector internally with mutation and then turn it immutable before revealing it to the outside world?", yes! These are known as transients in Clojure, and are used for efficient batch changes.

Cannot tell about Clojure, but I can give some comments about Scala Vectors.
Persistent Scala vectors (scala.collection.immutable.Vectors) are much slower than an array buffer when it comes to appending. In fact, they are 10x slower than the List prepend operation. They are 2x slower than appending to Conc-trees, which we use in Parallel Collections.
But, Scala also has mutable vectors -- they're hidden in the class VectorBuilder. Appending to mutable vectors does not preserve the previous version of the vector, but mutates it in place by keeping the pointer to the rightmost leaf in the vector. So, yes -- keeping the vector mutable internally, and than returning an immutable reference is exactly what's done in Scala collections.
The VectorBuilder is slightly faster than the ArrayBuffer, because it needs to allocate its arrays only once, whereas ArrayBuffer needs to do it twice on average (because of growing). Conc.Buffers, which we use as parallel array combiners, are twice as fast compared to VectorBuilders.
Benchmarks are here. None of the benchmarks involve any boxing, they work with reference objects to avoid any bias:
comparison of Scala List, Vector and Conc
comparison of Scala ArrayBuffer, VectorBuilder and Conc.Buffer
More collections benchmarks here.
These tests were executed using ScalaMeter.

Related

Mutable data types that use stack allocation

Based on my earlier question, I understand the benefit of using stack allocation. Suppose I have an array of arrays. For example, A is a list of matrices and each element A[i] is a 1x3 matrix. The length of A and the dimension of A[i] are known at run time (given by the user). Each A[i] is a matrix of Float64 and this is also known at run time. However, through out the program, I will be modifying the values of A[i] element by element. What data structure can also allow me to use stack allocation? I tried StaticArrays but it doesn't allow me to modify a static array.
StaticArrays defines MArray (MVector, MMatrix) types that are fixed-size and mutable. If you use these there's a higher chance of the compiler determining that they can be stack-allocated, but it's not guaranteed. Moreover, since the pattern you're using is that you're passing the mutable state vector into a function which presumably modifies it, it's not going to be valid or helpful to stack allocate that anyway. If you're going to allocate state once and modify it throughout the program, it doesn't really matter if it is heap or stack allocated—stack allocation is only a big win for objects that are allocated, used locally and then don't escape the local scope, so they can be “freed” simply by popping the stack.
From the code snippet you showed in the linked question, the state vector is allocated in the outer function, test_for_loop, which shouldn't be a big deal since it's done once at the beginning of execution. Using a variably sized state vector to index into an array with a splat (...) might be an issue, however, and that's done in test_function. Using something with fixed size like MVector might be better for that. It might, however, be better still, to use a state tuple and return a new rather than mutated state tuple at the end. The compiler is very good at turning that kind of thing into very efficient code because of immutability.
Note that by convention test_function should be called test_function! since it modifies its M argument and even more so if it modifies the state vector.
I would also note that this isn't a great question/answer pair since it's not standalone at all and really just a continuation of your other question. StackOverflow isn't very good for this kind of iterative question/discussion interaction, I'm afraid.

"cut and paste" the last k elements of std::vector efficiently?

Is it possible in C++11 "cut and paste" the last k elements of an std::vector A to a new std:::vector B in constant time?
One way would be to use B.insert(A.end() - k, A.end()) and then use erase on A but these are both O(k) time operations.
Mau
No, vectors own their memory.
This operation is known as splice. forward_list is ridiculously slow otherwise, but it does have an O(1) splice.
Typically, the process of deciding which elements to move is already O(n), so having the splice itself take O(n) time is not a problem. The other operations being faster on vector more than make up for it.
This isn't possible in general, since (at least in the C++03 version -- there it's 23.2.4/1) the C++ standard guarantees that the memory used by a vector<T> is a single contiguous block. Thus the only way to "transfer" more than a fixed number of elements in O(1) time would be if the receiving vector were empty, and you had somehow arranged to have it's allocated block of memory begin at the right place inside the first vector -- in which case the "transfer" could be argued to have taken no time at all. (Deliberately overlapping objects in this way is almost certain to constitute Undefined Behaviour in theory -- and in practice, it's also very fragile, since any operation that invalidates iterators to a vector<T> can also reallocate memory, thus breaking things.)
If you're prepared to sacrifice a whole bunch of portability, I've heard it's possible to play OS-level (or hardware-level) tricks with virtual memory mapping to achieve tricks like no-overhead ring buffers. Maybe these tricks could also be applied here -- but bear in mind that the assumption that the mapping of virtual to physical memory within a single process is one-to-one is very deeply ingrained, so you could be in for some surprises.

IndexedSeq.last complexity

When working with indexed collections (most often immutable Vectors) I am often using coll.last as what I supposed to be a convenient short-cut to coll(coll.size-1). When randomly inspecting my sources, I have clicked to see the last implementation, and the IntelliJ IDE took me to TraversableLike.last implementation, which traverses all elements to eventually reach the last one.
This was a surprise to me, and I am not sure now what is the reason for this. Is last really implemented this way? Is there some reason preventing last to be implemented for IndexedSeq (or perhaps for IndexedSeqLike) efficiently?
(Scala SDK used is 2.11.4)
IndexedSeq does not override last (it only inherits it from TraversableLike) - the fact that a particular sequence supports indexed access does not necessarily make indexed lookups faster than traversals. However, such optimized implementations are given in IndexedSeqOptimized, which I would expect many implementations to inherit from. In the specific case of Vector, last is overridden explicitly in the class itself.
IndexedSeq has constant access time for the arbitrary element. LinearSeq has linear time. TraversableLike is just common interface and you may find that it's overriden inside IndexedSeqOptimized trait:
A template trait for indexed sequences of type IndexedSeq[A] which
optimizes the implementation of several methods under the
assumption of fast random access.
def last: A = if (length > 0) this(length - 1) else super.last
You may also find the quick random access implementation inside Vector.getElem - it uses a tree of arrays with high branching factor, so usually it's O(1) for apply. It doesn't use IndexedSeqOptimized, but it has its own overriden last:
override /*TraversableLike*/ def last: A = {
if (isEmpty) throw new UnsupportedOperationException("empty.last")
apply(length-1)
}
So it's a little mess inside Scala collections, which is very common for Scala internals. Anyway last on IndexedSeqs is O(1) de facto, regardless such tricky collections architecture.
The Scala collections intricacy is actually an active topic. A talk (and slides) with Scala's collection framework criticism may be found at Paul Phillips: Scala Collections: Why Not?, and Paul Phillips is developing his alternate version of std.

Overhead of std::optional<T>?

Now that std::experimental::optional has been accepted (or is about to be accepted), I wonder what is the overhead and the consequences on the assembly generated when the inner value is get by the following operators :
->
*
value
value_or
compared to the case without std::optional. It could be particularly important for computationaly intensive programs.
For example, what would be order of magnitude of the overhead on operations on a std::vector<std::experimental::optional<double>> compared to a std::vector<double> ?
-> and * ought to have zero overhead.
value and value_or ought to have the overhead of one branch: if(active)
Also, copy/move constructor, copy/move assignment, swap, emplace, operator==, operator<, and the destructor ought to also have the overhead of one branch.
However, one banch of overhead is so small it probably can't even be measured. Seriously, write pretty code, and don't worry about the performance here. Odds are making the code pretty will result in it running faster than if you tried to make it fast. Counter-intuitive, but do it anyway.
There are definitely cases where the overhead becomes noticible, for instance sorting a large number of optionals. In these cases, there's four situations,
(A) all the optionals known to be empty ahead of time, in which case, why sort?
(B) Some optionals may or may not be active, in which case the overhead is required and there is no better way.
(C) All optionals are known to have values ahead of time and you don't need the sorted-data in place, in which case, use the zero overhead operators to make a copy of the data where the copy is using the raw type instead of optional, and sort that.
(D) All optionals are known to have values ahead of time, but you need the sorted data in-place. In this case, optional is adding unnecessary overhead, and the easiest way to work around it is to do step C, and then use the no-overhead operators to move the data back.
Besides the other answer, you should also consider that std::optional requires additional memory.
Often it's not just an extra byte, but (at least for "small" types) a 2x space overhead due to padding .
Maybe RAM isn't a problem but that also means fewer values available in the cache.
A sentinel value, if specific knowledge allows to use it, could be a better choice (probably in the form of markable to keep type safety).
An interesting reading is: Boost optional - Performance considerations

vector --> concurrent_vector migration + OpenGL restriction

I need to speed-up some calculation and result of calculation then used to draw OpenGL model.
Major speed-up archived when I changed std::vector to Concurrency::concurrent_vector and used parallel_for instead of just for loops.
This vector (or concurrent_vector) calculated in for (or parallel_for) loop and contains vertices for OpenGL to visualize.
It is fine using std::vector because OpenGL rendering procedure relies on the fact that std::vector keeps it's items in sequence which is not a case with concurrent_vector. Code runs something like this:
glVertexPointer(3, GL_FLOAT, 0, &vectorWithVerticesData[0]);
To generate concurrent_vector and copy it to std::vector is too expensive since there are lot of items.
So, the question is: I'd like to use OpenGL arrays, but also like to use concurrent_vector which is incompatible with OpenGL output.
Any suggestions?
You're trying to use a data structure that doesn't store its elements contiguously in an API that requires contiguous storage. Well, one of those has to give, and it's not going to be OpenGL. GL isn't going to walk concurrent_vector's data structure (not if you like performance).
So your option is to not use non-sequential objects.
I can only guess at what you're doing (since you didn't provide example code for the generator), so that limits what I can advise. If your parallel_for iterates for a fixed number of times (by "fixed", I mean a value that is known immediately before parallel_for executes. It doesn't change based on how many times you've iterated), then you can just use a regular vector.
Simply size the vector with vector::size. This will value-initialize the elements, which means that every element exists. You can now perform your parallel_for loop, but instead of using push_back or whatever, you simply copy the element directly into its location in the output. I think parallel_for can iterate over the actual vector iterators, but I'm not positive. Either way, it doesn't matter; you won't get any race conditions unless you try to set the same element from different threads.

Resources