Purpose of Xor Linked List? - performance

I stumbled on a Wikipedia article about the Xor linked list, and it's an interesting bit of trivia, but the article seems to imply that it occasionally gets used in real code. I'm curious what it's good for, since I don't see why it makes sense even in severely memory/cache constrained systems:
The main advantage of a regular doubly linked list is the ability to insert or delete elements from the middle of the list given only a pointer to the node to be deleted/inserted after. This can't be done with an xor-linked list.
If one wants O(1) prepending or O(1) insertion/removal while iterating then singly linked lists are good enough and avoid a lot of code complexity and caveats with respect to garbage collection, debugging tools, etc.
If one doesn't need O(1) prepending/insertion/removal then using an array is probably more efficient in both space and time. Even if one only needs efficient insertion/removal while iterating, arrays can be pretty good since the insertion/removal can be done while iterating.
Given the above, what's the point? Are there any weird corner cases where an xor linked list is actually worthwhile?

Apart from saving memory, it allows for O(1) reversal, while still supporting all the other destructive update operations efficienctly, like
concating two lists destructively in O(1)
insertAfter/insertBefore in O(1), when you only have a reference to the node and its successor/predecessor (which differs slightly from standard doubly linked lists)
remove in O(1), also with a reference to either the successor or predecessor.
I don't think the memory aspect is really important, since for most scenarios where you might use a XOR list, you can use a singly-linked list instead.

It is about saving memory. I had a situation where my data structure was 40 bytes. The memory manager aligned things on a 16 byte boundary, so each allocation was 48 bytes; regardless of the fact that I only needed 40. By using xor chain list, I was able to eliminate 8 bytes and drop my data structure size down to 32 bytes. Now, I can fit 2 nodes in the 64 byte pipeline cache at the same time. So, I was able to reduce memory usage, and improve performance.

Its purpose is (or more precisely was) just to save memory.
With a xor-linked-list you can do anything you can do with a ordinary doubly-linked list. The only difference is that you have to decode the previous and next memory addresses from the xor-ed pointer for each node every time you need them.

Related

Stack overflow solution with O(n) runtime

I have a problem related to runtime for push and pop in a stack.
Here, I implemented a stack using array.
I want to avoid overflow in the stack when I insert a new element to a full stack, so when the stack is full I do the following (Pseudo-Code):
(I consider a stack as an array)
Generate a new array with the size of double of the origin array.
Copy all the elements in the origin stack to the new array in the same order.
Now, I know that for a single push operation to the stack with the size of n the action executes in the worst case in O(n).
I want to show that the runtime of n pushes to an empty stack in the worst case is also O(n).
Also how can I update this algorithm that for every push the operation will execute in a constant runtime in the worst case?
Amortized constant-time is often just as good in practice if not better than constant-time alternatives.
Generate a new array with the size of double of the origin array.
Copy all the elements in the origin stack to the new array in the same order.
This is actually a very decent and respectable solution for a stack implementation because it has good locality of reference and the cost of reallocating and copying is amortized to the point of being almost negligible. Most generalized solutions to "growable arrays" like ArrayList in Java or std::vector in C++ rely on this type of solution, though they might not exactly double in size (lots of std::vector implementations increase their size by something closer to 1.5 than 2.0).
One of the reasons this is much better than it sounds is because our hardware is super fast at copying bits and bytes sequentially. After all, we often rely on millions of pixels being blitted many times a second in our daily software. That's a copying operation from one image to another (or frame buffer). If the data is contiguous and just sequentially processed, our hardware can do it very quickly.
Also how can I update this algorithm that for every push the operation
will execute in a constant runtime in the worst case?
I have come up with stack solutions in C++ that are ever-so-slightly faster than std::vector for pushing and popping a hundred million elements and meet your requirements, but only for pushing and popping in a LIFO pattern. We're talking about something like 0.22 secs for vector as opposed to 0.19 secs for my stack. That relies on just allocating blocks like this:
... of course typically with more than 5 elements worth of data per block! (I just didn't want to draw an epic diagram). There each block stores an array of contiguous data but when it fills up, it links to a next block. The blocks are linked (storing a previous link only) but each one might store, say, 512 bytes worth of data with 64-byte alignment. That allows constant-time pushes and pops without the need to reallocate/copy. When a block fills up, it just links a new block to the previous block and starts filling that up. When you pop, you just pop until the block becomes empty and then once it's empty, you traverse its previous link to get to the previous block before it and start popping from that (you can also free the now-empty block at this point).
Here's your basic pseudo-C++ example of the data structure:
template <class T>
struct UnrolledNode
{
// Points to the previous block. We use this to get
// back to a former block when this one becomes empty.
UnrolledNode* prev;
// Stores the number of elements in the block. If
// this becomes full with, say, 256 elements, we
// allocate a new block and link it to this one.
// If this reaches zero, we deallocate this block
// and begin popping from the previous block.
size_t num;
// Array of the elements. This has a fixed capacity,
// say 256 elements, though determined at runtime
// based on sizeof(T). The structure is a VLS to
// allow node and data to be allocated together.
T data[];
};
template <class T>
struct UnrolledStack
{
// Stores the tail end of the list (the last
// block we're going to be using to push to and
// pop from).
UnrolledNode<T>* tail;
};
That said, I actually recommend your solution instead for performance since mine barely has a performance edge over the simple reallocate and copy solutions and yours would have a slight edge when it comes to traversal since it can just traverse the array in a straightforward sequential fashion (as well as straightforward random-access if you need it). I didn't actually implement mine for performance reasons. I implemented it to prevent pointers from being invalidated when you push things to the container (the actual thing is a memory allocator in C) and, again, in spite of achieving true constant-time push backs and pop backs, it's still barely any faster than the amortized constant-time solution involving reallocation and memory copying.

"cut and paste" the last k elements of std::vector efficiently?

Is it possible in C++11 "cut and paste" the last k elements of an std::vector A to a new std:::vector B in constant time?
One way would be to use B.insert(A.end() - k, A.end()) and then use erase on A but these are both O(k) time operations.
Mau
No, vectors own their memory.
This operation is known as splice. forward_list is ridiculously slow otherwise, but it does have an O(1) splice.
Typically, the process of deciding which elements to move is already O(n), so having the splice itself take O(n) time is not a problem. The other operations being faster on vector more than make up for it.
This isn't possible in general, since (at least in the C++03 version -- there it's 23.2.4/1) the C++ standard guarantees that the memory used by a vector<T> is a single contiguous block. Thus the only way to "transfer" more than a fixed number of elements in O(1) time would be if the receiving vector were empty, and you had somehow arranged to have it's allocated block of memory begin at the right place inside the first vector -- in which case the "transfer" could be argued to have taken no time at all. (Deliberately overlapping objects in this way is almost certain to constitute Undefined Behaviour in theory -- and in practice, it's also very fragile, since any operation that invalidates iterators to a vector<T> can also reallocate memory, thus breaking things.)
If you're prepared to sacrifice a whole bunch of portability, I've heard it's possible to play OS-level (or hardware-level) tricks with virtual memory mapping to achieve tricks like no-overhead ring buffers. Maybe these tricks could also be applied here -- but bear in mind that the assumption that the mapping of virtual to physical memory within a single process is one-to-one is very deeply ingrained, so you could be in for some surprises.

Efficiency of appending to vectors

Appending an element onto a million-element ArrayList has the cost of setting one reference now, and copying one reference in the future when the ArrayList must be resized.
As I understand it, appending an element onto a million-element PersistenVector must create a new path, which consists of 4 arrays of size 32. Which means more than 120 references have to be touched.
How does Clojure manage to keep the vector overhead to "2.5 times worse" or "4 times worse" (as opposed to "60 times worse"), which has been claimed in several Clojure videos I have seen recently? Has it something to do with caching or locality of reference or something I am not aware of?
Or is it somehow possible to build a vector internally with mutation and then turn it immutable before revealing it to the outside world?
I have tagged the question scala as well, since scala.collection.immutable.vector is basically the same thing, right?
Clojure's PersistentVector's have special tail buffer to enable efficient operation at the end of the vector. Only after this 32-element array is filled is it added to the rest of the tree. This keeps the amortized cost low. Here is one article on the implementation. The source is also worth a read.
Regarding, "is it somehow possible to build a vector internally with mutation and then turn it immutable before revealing it to the outside world?", yes! These are known as transients in Clojure, and are used for efficient batch changes.
Cannot tell about Clojure, but I can give some comments about Scala Vectors.
Persistent Scala vectors (scala.collection.immutable.Vectors) are much slower than an array buffer when it comes to appending. In fact, they are 10x slower than the List prepend operation. They are 2x slower than appending to Conc-trees, which we use in Parallel Collections.
But, Scala also has mutable vectors -- they're hidden in the class VectorBuilder. Appending to mutable vectors does not preserve the previous version of the vector, but mutates it in place by keeping the pointer to the rightmost leaf in the vector. So, yes -- keeping the vector mutable internally, and than returning an immutable reference is exactly what's done in Scala collections.
The VectorBuilder is slightly faster than the ArrayBuffer, because it needs to allocate its arrays only once, whereas ArrayBuffer needs to do it twice on average (because of growing). Conc.Buffers, which we use as parallel array combiners, are twice as fast compared to VectorBuilders.
Benchmarks are here. None of the benchmarks involve any boxing, they work with reference objects to avoid any bias:
comparison of Scala List, Vector and Conc
comparison of Scala ArrayBuffer, VectorBuilder and Conc.Buffer
More collections benchmarks here.
These tests were executed using ScalaMeter.

Pseudo Least Recently Used Binary Tree

The logic behind Pseudo LRU is to use less bits and to speed up the replacement of the block. The logic is given as "let 1 represent that the left side has been referenced more recently than the right side, and 0 vice-versa"
But I am unable to understand the implementation given in the following diagram:
Details are given at : http://courses.cse.tamu.edu/ejkim/614/CSCE614-2011c-HW4-tutorial.pptx
I'm also studying about Pseudo-LRU.
Here is my understand. Hope it's helpful.
"Hit CL1": there's a referent to CL1, and hit
LRU state (B0 and B1) are changed to inform CL1 is recently referred.
"Hit CL0": there's a referent to CL0, and hit
LRU state (B1) is updated to inform that CL0 is recently used (than CL1)
"Miss; CL2 replace"
There's a miss, and LRU is requested for replacement index.
As current state, CL2 is chose.
LRU state (B0 and B2) are updated to inform CL2 is recently used.
(it's also cause next replacement will be CL1)
I know there is already an answer which clearly explains the photo, but I wanted to post my way of thinking to implement a fast pseudo-LRU algorithm and it's advantages over classic LRU.
From the memory point of view, if there are N objects(pointers, 32/64 bit values) you need N-1 flag bits and a HashMap to store the information of objects( pointer to actual address and position in the array) for querying if an elements exists already in cache. It doesn't use less memory than classic LRU, actually it uses N-1 auxiliar bits.
The optimization comes from cpu time. Comparing some flags takes really no time because they are bits. In classic LRU you must have some sort of structure which permits insertion/deletion and you can take the LRU fast(maybe heap). This structure takes O(log(N)) for a usual operation, but also the comparison between values is expensive. So in the end you end up with O(log(N)^2) complexity per operation, instead of O(log(N)) for Pseudo-LRU.
Even if Pseudo-LRU doesn't always take the LRU object out when there is a cache miss, in practice it seems that it behaves pretty good and it's not a major drawback.

Why does this AVL tree implementation pack bits into pointers in 64-bit but not 32-bit implementations?

In this AVL tree implementation from Solaris, struct avl_node is defined in an obvious way if compiling for 32-bit library.
But for 64-library a pointer to node's parent is packed into "avl_pcb". And it looks like only 61 bits of a ponter are stored.
Why this does work?
Why not make similar thing for 32-bit?
On a 64-bit machine, pointers are usually aligned to be at word boundaries, which are at multiples of eight bytes. As a result, the lowest three bits of a pointer will be zero. Consequently, if a data structure needs three bits of information, it can pack them into the lowest three bits of a pointer. That way:
To follow the pointer, clear the lowest three bits of the pointer value, then dereferences it.
To read any of the three bits, mask out the rest of the bits in the pointer and read them directly.
This approach is pretty standard and doesn't lose any ability to point to addresses, since usually for performance or hardware reasons you wouldn't want to have non-aligned pointers anyway.
What I'm not sure about is why they didn't do this in the 32-bit case, since with three pointers they could easily hide the necessary bits using the same trick but with two bits per pointer. My guess is that it's a performance thing: if you do pack bits into the bottom of pointers, you increase the cost of following the pointer because of the computation necessary to clear the bits. Note, for instance, that in the 64-bit case that the bits are packed into the parent pointer, which is only used for uncommon operations like computing inorder successors or doing rotations on an insert or delete. This keeps lookups fast. In the 32-bit case, to hide 3 bits, the implementation would need to use the lower bits of two pointers, one of which would have to be the left or right pointer. My guess is that the performance hit of slowing down tree searches wasn't worth the space savings, so they decided to just take the memory hit and store them separately. This is just speculation, though, since they absolutely could have stored the bits in the bottoms of their pointers if they wanted to.
On a side note: if the implementation was using a red/black tree rather than an AVL tree, then only two bits of information would be necessary: a bit to tell if the node is red or black, and a bit to tell whether the node is a left or right child. In that case, the two bits required could always be packed into a 32-bit pointer. This is one reason why red-black trees are popular.
Hope this helps!

Resources