"cut and paste" the last k elements of std::vector efficiently? - algorithm

Is it possible in C++11 "cut and paste" the last k elements of an std::vector A to a new std:::vector B in constant time?
One way would be to use B.insert(A.end() - k, A.end()) and then use erase on A but these are both O(k) time operations.
Mau

No, vectors own their memory.
This operation is known as splice. forward_list is ridiculously slow otherwise, but it does have an O(1) splice.
Typically, the process of deciding which elements to move is already O(n), so having the splice itself take O(n) time is not a problem. The other operations being faster on vector more than make up for it.

This isn't possible in general, since (at least in the C++03 version -- there it's 23.2.4/1) the C++ standard guarantees that the memory used by a vector<T> is a single contiguous block. Thus the only way to "transfer" more than a fixed number of elements in O(1) time would be if the receiving vector were empty, and you had somehow arranged to have it's allocated block of memory begin at the right place inside the first vector -- in which case the "transfer" could be argued to have taken no time at all. (Deliberately overlapping objects in this way is almost certain to constitute Undefined Behaviour in theory -- and in practice, it's also very fragile, since any operation that invalidates iterators to a vector<T> can also reallocate memory, thus breaking things.)
If you're prepared to sacrifice a whole bunch of portability, I've heard it's possible to play OS-level (or hardware-level) tricks with virtual memory mapping to achieve tricks like no-overhead ring buffers. Maybe these tricks could also be applied here -- but bear in mind that the assumption that the mapping of virtual to physical memory within a single process is one-to-one is very deeply ingrained, so you could be in for some surprises.

Related

Space complexity of reassigning an array

What is the space complexity of the following Java code?
public int[] foo(int[] x) {
x = new int[x.length];
// Do stuff with x that does not require additional memory
return x
}
Is it O(1) or O(N)? I've seen both answers. But I can't understand how it could be O(1). I would guess that it's O(N). We create a new array of the same size while the original array might still exist. Thus the original array is not replaced, i.e. we allocated additional storage space that increases linear with the length N of the input array. Am I correct?
The semantic of this piece of code is unsure, as the language is not specified. In any case, O(1) isn't possible because one allocates a new array at the same time that the original exists. (In a garbage collected language, one could imagine, with a lot of bad faith, that x is deallocated then immediately reallocated at the same place.)
O(N) with details that are highly dependent on various external factors such as your operating system.
A naive implementation requires actually zeroing out memory and assigning it. If the space requested is small, and particularly if you've already freed a good place to put it, this is probably what is going to happen. That operation is O(N).
If the space requested is large, you're probably just going to set up page table entries and NOT allocate any space. This is again O(N), but with extremely good constants. As you use the memory it actually has to get assigned, which is no faster than doing it up front. (It is actually slower.) But, in the meantime, being slow to use up memory is good for reducing contention on RAM.

Mutable data types that use stack allocation

Based on my earlier question, I understand the benefit of using stack allocation. Suppose I have an array of arrays. For example, A is a list of matrices and each element A[i] is a 1x3 matrix. The length of A and the dimension of A[i] are known at run time (given by the user). Each A[i] is a matrix of Float64 and this is also known at run time. However, through out the program, I will be modifying the values of A[i] element by element. What data structure can also allow me to use stack allocation? I tried StaticArrays but it doesn't allow me to modify a static array.
StaticArrays defines MArray (MVector, MMatrix) types that are fixed-size and mutable. If you use these there's a higher chance of the compiler determining that they can be stack-allocated, but it's not guaranteed. Moreover, since the pattern you're using is that you're passing the mutable state vector into a function which presumably modifies it, it's not going to be valid or helpful to stack allocate that anyway. If you're going to allocate state once and modify it throughout the program, it doesn't really matter if it is heap or stack allocated—stack allocation is only a big win for objects that are allocated, used locally and then don't escape the local scope, so they can be “freed” simply by popping the stack.
From the code snippet you showed in the linked question, the state vector is allocated in the outer function, test_for_loop, which shouldn't be a big deal since it's done once at the beginning of execution. Using a variably sized state vector to index into an array with a splat (...) might be an issue, however, and that's done in test_function. Using something with fixed size like MVector might be better for that. It might, however, be better still, to use a state tuple and return a new rather than mutated state tuple at the end. The compiler is very good at turning that kind of thing into very efficient code because of immutability.
Note that by convention test_function should be called test_function! since it modifies its M argument and even more so if it modifies the state vector.
I would also note that this isn't a great question/answer pair since it's not standalone at all and really just a continuation of your other question. StackOverflow isn't very good for this kind of iterative question/discussion interaction, I'm afraid.

Efficiency of appending to vectors

Appending an element onto a million-element ArrayList has the cost of setting one reference now, and copying one reference in the future when the ArrayList must be resized.
As I understand it, appending an element onto a million-element PersistenVector must create a new path, which consists of 4 arrays of size 32. Which means more than 120 references have to be touched.
How does Clojure manage to keep the vector overhead to "2.5 times worse" or "4 times worse" (as opposed to "60 times worse"), which has been claimed in several Clojure videos I have seen recently? Has it something to do with caching or locality of reference or something I am not aware of?
Or is it somehow possible to build a vector internally with mutation and then turn it immutable before revealing it to the outside world?
I have tagged the question scala as well, since scala.collection.immutable.vector is basically the same thing, right?
Clojure's PersistentVector's have special tail buffer to enable efficient operation at the end of the vector. Only after this 32-element array is filled is it added to the rest of the tree. This keeps the amortized cost low. Here is one article on the implementation. The source is also worth a read.
Regarding, "is it somehow possible to build a vector internally with mutation and then turn it immutable before revealing it to the outside world?", yes! These are known as transients in Clojure, and are used for efficient batch changes.
Cannot tell about Clojure, but I can give some comments about Scala Vectors.
Persistent Scala vectors (scala.collection.immutable.Vectors) are much slower than an array buffer when it comes to appending. In fact, they are 10x slower than the List prepend operation. They are 2x slower than appending to Conc-trees, which we use in Parallel Collections.
But, Scala also has mutable vectors -- they're hidden in the class VectorBuilder. Appending to mutable vectors does not preserve the previous version of the vector, but mutates it in place by keeping the pointer to the rightmost leaf in the vector. So, yes -- keeping the vector mutable internally, and than returning an immutable reference is exactly what's done in Scala collections.
The VectorBuilder is slightly faster than the ArrayBuffer, because it needs to allocate its arrays only once, whereas ArrayBuffer needs to do it twice on average (because of growing). Conc.Buffers, which we use as parallel array combiners, are twice as fast compared to VectorBuilders.
Benchmarks are here. None of the benchmarks involve any boxing, they work with reference objects to avoid any bias:
comparison of Scala List, Vector and Conc
comparison of Scala ArrayBuffer, VectorBuilder and Conc.Buffer
More collections benchmarks here.
These tests were executed using ScalaMeter.

Pseudo Least Recently Used Binary Tree

The logic behind Pseudo LRU is to use less bits and to speed up the replacement of the block. The logic is given as "let 1 represent that the left side has been referenced more recently than the right side, and 0 vice-versa"
But I am unable to understand the implementation given in the following diagram:
Details are given at : http://courses.cse.tamu.edu/ejkim/614/CSCE614-2011c-HW4-tutorial.pptx
I'm also studying about Pseudo-LRU.
Here is my understand. Hope it's helpful.
"Hit CL1": there's a referent to CL1, and hit
LRU state (B0 and B1) are changed to inform CL1 is recently referred.
"Hit CL0": there's a referent to CL0, and hit
LRU state (B1) is updated to inform that CL0 is recently used (than CL1)
"Miss; CL2 replace"
There's a miss, and LRU is requested for replacement index.
As current state, CL2 is chose.
LRU state (B0 and B2) are updated to inform CL2 is recently used.
(it's also cause next replacement will be CL1)
I know there is already an answer which clearly explains the photo, but I wanted to post my way of thinking to implement a fast pseudo-LRU algorithm and it's advantages over classic LRU.
From the memory point of view, if there are N objects(pointers, 32/64 bit values) you need N-1 flag bits and a HashMap to store the information of objects( pointer to actual address and position in the array) for querying if an elements exists already in cache. It doesn't use less memory than classic LRU, actually it uses N-1 auxiliar bits.
The optimization comes from cpu time. Comparing some flags takes really no time because they are bits. In classic LRU you must have some sort of structure which permits insertion/deletion and you can take the LRU fast(maybe heap). This structure takes O(log(N)) for a usual operation, but also the comparison between values is expensive. So in the end you end up with O(log(N)^2) complexity per operation, instead of O(log(N)) for Pseudo-LRU.
Even if Pseudo-LRU doesn't always take the LRU object out when there is a cache miss, in practice it seems that it behaves pretty good and it's not a major drawback.

Purpose of Xor Linked List?

I stumbled on a Wikipedia article about the Xor linked list, and it's an interesting bit of trivia, but the article seems to imply that it occasionally gets used in real code. I'm curious what it's good for, since I don't see why it makes sense even in severely memory/cache constrained systems:
The main advantage of a regular doubly linked list is the ability to insert or delete elements from the middle of the list given only a pointer to the node to be deleted/inserted after. This can't be done with an xor-linked list.
If one wants O(1) prepending or O(1) insertion/removal while iterating then singly linked lists are good enough and avoid a lot of code complexity and caveats with respect to garbage collection, debugging tools, etc.
If one doesn't need O(1) prepending/insertion/removal then using an array is probably more efficient in both space and time. Even if one only needs efficient insertion/removal while iterating, arrays can be pretty good since the insertion/removal can be done while iterating.
Given the above, what's the point? Are there any weird corner cases where an xor linked list is actually worthwhile?
Apart from saving memory, it allows for O(1) reversal, while still supporting all the other destructive update operations efficienctly, like
concating two lists destructively in O(1)
insertAfter/insertBefore in O(1), when you only have a reference to the node and its successor/predecessor (which differs slightly from standard doubly linked lists)
remove in O(1), also with a reference to either the successor or predecessor.
I don't think the memory aspect is really important, since for most scenarios where you might use a XOR list, you can use a singly-linked list instead.
It is about saving memory. I had a situation where my data structure was 40 bytes. The memory manager aligned things on a 16 byte boundary, so each allocation was 48 bytes; regardless of the fact that I only needed 40. By using xor chain list, I was able to eliminate 8 bytes and drop my data structure size down to 32 bytes. Now, I can fit 2 nodes in the 64 byte pipeline cache at the same time. So, I was able to reduce memory usage, and improve performance.
Its purpose is (or more precisely was) just to save memory.
With a xor-linked-list you can do anything you can do with a ordinary doubly-linked list. The only difference is that you have to decode the previous and next memory addresses from the xor-ed pointer for each node every time you need them.

Resources