Realistic usage of unrolled skip lists - data-structures

Why there is no any information in Google / Wikipedia about unrolled skip list? e.g. combination between unrolled linked list and skip list.

Probably because it wouldn't typically give you much of a performance improvement, if any, and it would be somewhat involved to code correctly.
First, the unrolled linked list typically uses a pretty small node size. As the Wikipedia article says: " just large enough so that the node fills a single cache line or a small multiple thereof." On modern Intel processors, a cache line is 64 bytes. Skip list nodes have, on average, two pointers per node, which means an average of 16 bytes per node for the forward pointers. Plus whatever the data for the node is: 4 or 8 bytes for a scalar value, or 8 bytes for a reference (I'm assuming a 64 bit machine here).
So figure 24 bytes, total, for an "element." Except that the elements aren't fixed size. They have a varying number of forward pointers. So you either need to make each element a fixed size by allocating an array for the maximum number of forward pointers for each element (which for a skip list with 32 levels would require 256 bytes), or use a dynamically allocated array that's the correct size. So your element becomes, in essence:
struct UnrolledSkipListElement
{
void* data; // 64-bit pointer to data item
UnrolledSkipListElement* forward_pointers; // dynamically allocated
}
That would reduce your element size to just 16 bytes. But then you lose much of the cache-friendly behavior that you got from unrolling. To find out where you go next, you have to dereference the forward_pointers array, which is going to incur a cache miss, and therefore eliminate the savings you got by doing the unrolling. In addition, that dynamically allocated array of pointers isn't free: there's some (small) overhead involved in allocating that memory.
If you can find some way around that problem, you're still not going to gain much. A big reason for unrolling a linked list is that you must visit every node (up to the node you find) when you're searching it. So any time you can save with each link traversal adds up to very big savings. But with a skip list you make large jumps. In a perfectly organized skip list, for example, you could skip half the nodes on the first jump (if the node you're looking for is in the second half of the list). If your nodes in the unrolled skip list only contain four elements, then the only savings you gain will be at levels 0, 1, and 2. At higher levels you're skipping more than three nodes ahead and as a result you will incur a cache miss.
So the skip list isn't unrolled because it would be somewhat involved to implement and it wouldn't give you much of a performance boost, if any. And it might very well cause the list to be slower.

Linked list complexity is O(N)
Skip list complexity is O(Log N)
Unrolled Linked List complexity can be calculate as following:
O (N / (M / 2) + Log M) = O (2N/M + Log M)
Where M is number of elements in single node.
Because Log M is not significant,
Unrolled Linked List complexity is O(N/M)
If we suppose to combine Skip list with Unrolled linked list, the new complexity will be
O(Log N + "something from unrolled linked list such N1/M")
This means the "new" complexity will not be as better as first someone will think. New complexity might be even worse than original O(Log N). The implementation will more complex as well. So gain is questionable and rather dubious.
Also, since single node will have lots of data, but only single "forward" array, the "tree" will not be so-balanced either and this will ruin O(Log N) part of the equation.

Related

Merging two sorted linked list--understanding why it's O(1) vs. O(N) space complexity

The majority of implementations I've seen for merging two sorted linked lists iteratively are as follows.
Create a dummy node. Point it to the linked list head that has the smaller value. Move that head to its next node. Move dummy pointer to its next node. Repeat.
I don't understand why this procedure has space complexity of O(1) and not O(N)? While we are pointing the dummy node to existing nodes in two linked lists, we're essentially creating a new linked list-- one that interweaves the two existing lists. Consequently, doesn't this still require O(N) space? The dummy node is the head of its own linked list that is separate from the original two linked lists, even though it uses the same nodes...
You are absolutely right that you are going to need Θ(n) storage space to hold the result of merging two lists of total length n. But how much of that storage space was already there before the function started running, and how much of that storage space is new? You already had two lists of n total elements, so you already were using Θ(n) space before you started this algorithm, and when you're done you have the same lists lying around, just rewired so that the next pointers might be pointing to different places. As a result, the amount of memory you needed to allocate for this procedure is not Θ(n), but rather Θ(1).
More generally, it's common when measuring space complexity to ignore the space used by the inputs to the problem, because in some sense that space cost is unavoidable and there's nothing you can do to eliminate it.
One piece of advice going forward: if you write something like O(1) or O(n), it's often a good idea to make clear whether you're measuring time or space. For example, it's clearer to say that the procedure needs O(n) memory or O(1) time rather than to say that the procedure "is" O(n) or "is" O(1), since it's unclear what you're measuring with the big-O notation when you do that.
To follow up with templatetypedef's answer, since the same nodes are being used for both the input and the output, there's no additional space used there.
A dummy node is only needed in languages that don't support a pointer to pointer and/or to simplify the code. In the case of C / C++ you could use something like this:
NODE *merge(NODE *pSrc1, NODE *pSrc2){
NODE *pDst = NULL; // ptr to destination (merged) list
NODE *ppDst = &pDst; // ptr to pDst or some next pointer
// ...
*ppDst = ... // add a node to the destination list
// ...
ppDst = &((*ppDst)->next) // advance ppDst
// ...
return pDst; // return ptr to merged list
}

Running maximum of changing array of fixed size

At first, I am given an array of fixed size, call it v. The typical size of v would be a few thousand entries. I start by computing the maximum of that array.
Following that, I am periodically given a new value for v[i] and need to recompute the value of the maximum.
What is a practically fast way (average time) of computing that maximum?
Edit: we can assume that the process is:
1) uniformly choosing a random entry;
2) changing its value to a uniform value between [0,1].
I believe this specifies the problem a bit better and allows an unequivocal "best answer" (which will depend on the array size).
You can maintain a max-heap of that array. The element can be index to the array. for every element of the array, you should also have some indexes to the element of max-heap. so every time v[i] is changed, you only need O(log(n)) to maintain the heap. (if v[i] is increased, it will go up in the heap, if v[i] is decreased, it will go down in the heap).
If the changes to the array are random, e.g. v[rand()%size] = rand(), then most of the time the max won't decrease.
There are two main ways I can think of to handle this: keep the full collection sorted on the fly, or track just the few (or one) highest elements. The choice depends on the relative importance of worst-case, average case, and fast-path. (Including code and data cache footprint of the common case where the change doesn't affect anything you're tracking.)
Really low complexity / overhead / code size: O(1) average case, O(N) worst case.
Just track the current max, (and optionally its position, if you can't get the old value to see if it == max before applying the change). On the rare occasion that the element holding the max decreased, rescan the whole array. Otherwise just see if the new element is greater than max.
The average complexity should be O(1) amortized: O(N) for N changes, since on average one of N changes affects the element holding the max. (And only half those changes decrease it).
A bit more overhead and code size, but less frequent scans of the full array: O(1) typical case, O(N) worst case.
Keep a priority queue of the 4 or 8 highest elements in the array (position and value). When an element in the PQueue is modified, remove it from the PQueue. Try to re-add the new value to the PQueue, but only if it won't be the smallest element. (It might be smaller than some other element we're not tracking). If the PQueue is empty, rescan the array to rebuild it to full size. The current max is the front of the PQueue. Rescanning the array should be quite rare, and in most cases we only have to touch about one cache line of data holding our PQueue.
Since the small PQueue needs to support fast access to the smallest and the largest element, and even finding elements that aren't the min or max, a sorted-array implementation probably makes the most sense, rather than a Heap. If it's only 8 elements, a linear search is probably best, too. (From the smallest element upwards, so the search ends right away if the old value of the element modified is less than the smallest value in the PQueue, the search stops right away.)
If you want to optimize the fast-path (position modified wasn't in the PQueue), you could store the PQueue as struct pqueue { unsigned pos[8]; int val[8]; }, and use vector instructions (e.g. x86 SSE/AVX2) to test i against all 8 positions in one or two tests. Hrm, actually just checking the old val to see if it's less than PQ.val[0] should be a good fast-path.
To track the current size of the PQueue, it's probably best to use a separate counter, rather than a sentinel value in pos[]. Checking for the sentinel every loop iteration is probably slower. (esp. since you'd prob. need to use pos to hold the sentinel values; maybe make it signed after all and use -1?) If there was a sentinel you could use in val[], that might be ok.
slower O(log N) average case, but no full-rescan worst case:
Xiaotian Pei's solution of making the whole array a heap. (This doesn't work if the ordering of v[] matters. You could keep all the elements in a Heap as well as in the ordered array, but that sounds cumbersome.) Re-heapifying after changing a random element will probably write several other cache lines every time, so the common case is much slower than for the methods that only track the top one or few elements.
something else clever I haven't thought of?

Designing a data structure acts like improved stack

I have been asked to design a data structure which will act like a stack, not limited in size, which will support the following methods, with given run-time restrictions.
push(s) - push s to the data structure - O(1)
pop() - remove and return the last element inserted O(1)
middle() - return the element (without removing) with index n/2 by insertion order where n is the current amount of elements in the data structure. - O(1)
peekAt(k) - return the kth element by insertion order (the bottom of the stack is k=1) - O(log(k))
I thought of using linked list, and always keep a pointer to the middle element, but then I had problem with implemnting peekAt(k). any ideas how can I implement this?
If the O(1) restriction can be relaxed to amortized O(1), a typical variable-length array implementation will do. When you allocate space for the array of current length N, reserve say N extra space at the end. Once you grow beyond this border, reallocate with the new size following the same strategy, copy the old contents there and free the old memory. Of course, you will have to maintain both the allocated and the actual length of your stack. The operations middle and peekAt can be done trivially in O(1).
Conversely, you may also shrink the array if it occupies less than 1/4 of the allocated space if the need arises.
All operations will be amortized O(1). The precise meaning of this is that for any K stack operations since the start, you will have to execute O(K) instructions in total. In particular, the number of reallocations after N pushes will be O(log(N)), and the total amount of elements copied due to reallocation will be no more than 1 + 2 + 4 + 8 ... + N <= 2N = O(N).
This can be done asymptotically better, requiring non-amortized O(1) for each operation, provided that the memory manager's allocate and free perform in O(1) for any size. The basic idea is to maintain the currently allocated stack and the 2x bigger future stack, and to start preparing the bigger copy in advance. Each time you push a value onto the present stack, copy two more elements into the future stack. When the present stack is full, all of its elements will be already copied into the future stack. After that, discard the present stack, declare that the future stack is now the present stack, and allocate a new future stack (currently empty, but allocated 2x bigger than the current one).
If you also need shrinking, you can maintain a smaller copy in a similar fashion when your stack occupies between 1/2 and 1/4 of the allocated space.
As you can see by the description, while this may be theoretically better, it is generally slower since it has to maintain two copies of the stack instead of one. However, this approach can be useful if you have a strict realtime O(1) requirement for each operation.
The implementation using a doubly linked list makes sense to me. Push and Pop would be implemented as it is usually done for a stack; The access to the 'middle' element would be done with an additional reference which would be updated on Push and Pop, depending on whether the number of contained elements would change from even to odd or vice versa. The peekAt Operation could be done using binary search.

How fast is Data.Sequence.Seq compared to []?

Clearly Seq asymptotically performs the same or better as [] for all possible operations. But since its structure is more complicated than lists, for small sizes its constant overhead will probably make it slower. I'd like to know how much, in particular:
How much slower is <| compared to :?
How much slower is folding over/traversing Seq compared to folding over/traversing [] (excluding the cost of a folding/traversing function)?
What is the size (approximately) for which \xs x -> xs ++ [x] becomes slower than |>?
What is the size (approximately) for which ++ becomes slower than ><?
What's the cost of calling viewl and pattern matching on the result compared to pattern matching on a list?
How much memory does an n-element Seq occupy compared to an n-element list? (Not counting the memory occupied by the elements, only the structure.)
I know that it's difficult to measure, since with Seq we talk about amortized complexity, but I'd like to know at least some rough numbers.
This should be a start - http://www.haskell.org/haskellwiki/Performance#Data.Sequence_vs._lists
A sequence uses between 5/6 and 4/3 times as much space as the equivalent list (assuming an overhead of one word per node, as in GHC). If only deque operations are used, the space usage will be near the lower end of the range, because all internal nodes will be ternary. Heavy use of split and append will result in sequences using approximately the same space as lists. In detail:
a list of length n consists of n cons nodes, each occupying 3 words.
a sequence of length n has approximately n/(k-1) nodes, where k is the average arity of the internal nodes (each 2 or 3). There is a pointer, a size and overhead for each node, plus a pointer for each element, i.e. n(3/(k-1) + 1) words.
List is a non-trivial constant-factor faster for operations at the head (cons and head), making it a more efficient choice for stack-like and stream-like access patterns. Data.Sequence is faster for every other access pattern, such as queue and random access.
I have one more concrete result to add to above answer. I am solving a Langevin equation. I used List and Data.Sequence. A lot of insertions at back of list/sequence are going on in this solution.
To sum up, I did not see any improvement in speed, in fact performance deteriorated with Sequences. Moreover with Data.Sequence, I need to increase the memory available for Haskell RTS.
Since I am definitely not an authority on optimizing; I post the both cases below. I'd be glad to know if this can be improved. Both codes were compiled with -O2 flag.
Solution with List, takes approx 13.01 sec
Solution with Data.Sequence, takes approx 15.13 sec

Quicksort - which sub-part should be sorted first?

I am reading some text which claims this regarding the ordering of the two recursive Quicksort calls:
... it is important to call the smaller subproblem first, this in conjunction with tail recursion ensures that the stack depth is log n.
I am not at all sure what that means, why should I call Quicksort on the smaller subarray first?
Look at quicksort as an implicit binary tree. The pivot is the root, and the left and right subtrees are the partitions you create.
Now consider doing a depth first search of this tree. The recursive calls actually correspond to doing a depth first search on the implicit tree described above. Also assume that the tree always has the smaller sub-tree as the left child, so the suggestion is in fact to do a pre-order on this tree.
Now suppose you implement the preorder using a stack, where you push only the left child (but keep the parent on the stack) and when the time comes to push the right child (say you maintained a state where you knew whether a node has its left child explored or not), you replace the top of stack, instead of pushing the right child (this corresponds to the tail recursion part).
The maximum stack depth is the maximum 'left depth': i.e. if you mark each edge going to a left child as 1, and going to a right child as 0, then you are looking at the path with maximum sum of edges (basically you don't count the right edges).
Now since the left sub-tree has no more than half the elements, each time you go left (i.e. traverse and edge marked 1), you are reducing the number of nodes left to explore by at least half.
Thus the maximum number of edges marked 1 that you see, is no more than log n.
Thus the stack usage is no more than log n, if you always pick the smaller partition, and use tail recursion.
Some languages have tail recursion. This means that if you write f(x) { ... ... .. ... .. g(x)} then the final call, to g(x), isn't implemented with a function call at all, but with a jump, so that the final call does not use any stack space.
Quicksort splits the data to be sorted into two sections. If you always handle the shorter section first, then each call that consumes stack space has a section of data to sort that is at most half the size of the recursive call that called it. So if you start off with 10 elements to sort, the stack at its deepest will have a call sorting those 10 elements, and then a call sorting at most 5 elements, and then a call sorting at most 2 elements, and then a call sorting at most 1 element - and then, for 10 elements, the stack cannot go any deeper - the stack size is limited by the log of the data size.
If you didn't worry about this, you could end up with the stack holding a call sorting 10 elements, and then a call sorting 9 elements, and then a call sorting 8 elements, and so on, so that the stack was as deep as the number of elements to be sorted. But this can't happen with tail recursion if you sort the short sections first, because although you can split 10 elements into 1 element and 9 elements, the call sorting 9 elements is done last of all and implemented as a jump, which doesn't use any more stack space - it reuses the stack space previously used by its caller, which was just about to return anyway.
Ideally, the list is partitions into two roughly similar size sublists. It doesn't matter much which sublist you work on first.
But if on a bad day the list partitions in the most lopsided way possible, a sublist of two or three items, maybe four, and a sublist nearly as long as the original. This could be due to bad choices of partition value or wickedly contrived data. Imagine what would happen if you worked on the bigger sublist first. The first invocation of Quicksort is holding the pointers/indices for the short list in its stack frame while recursively calling quicksort for the long list. This too partitions badly into a very short list and a long one, and we do the longer sublist first, repeat...
Ultimately, on the baddest of bad days with the wickedest of wicked data, we'll have stack frames built up in number proportional to the original list length. This is quicksort's worst case behavior, O(n) depth of recursive calls. (Note we are talking of quicksort's depth of recursion, not performance.)
Doing the shorter sublist first gets rid of it fairly quickly. We still process a larger number of tiny lists, in proportion to the original list length, but now each one is taken care of by a shallow one or two recursive calls. We still make O(n) calls (performance) but each is depth O(1).
Surprisingly, this turns out to be important even when quicksort is not confronted with wildly unbalanced partitions, and even when introsort is actually being used.
The problem arises (in C++) when the values in the container being sorted are really big. By this, I don't mean that they point to really big objects, but that they are themselves really big. In that case, some (possibly many) compilers will make the recursive stack frame quite big, too, because it needs at least one temporary value in order to do a swap. Swap is called inside of partition, which is not itself recursive, so you would think that the quicksort recursive driver would not require the monster stack-frame; unfortunately, partition usually ends up being inlined because it's nice and short, and not called from anywhere else.
Normally the difference between 20 and 40 stack frames is negligible, but if the values weigh in at, say, 8kb, then the difference between 20 and 40 stack frames could mean the difference between working and stack overflow, if stacks have been reduced in size to allow for many threads.
If you use the "always recurse into the smaller partition" algorithm, the stack cannot every exceed log2 N frames, where N is the number of elements in the vector. Furthermore, N cannot exceed the amount of memory available divided by the size of an element. So on a 32-bit machine, the there could only be 219 8kb elements in a vector, and the quicksort call depth could not exceed 19.
In short, writing quicksort correctly makes its stack usage predictable (as long as you can predict the size of a stack frame). Not bothering with the optimization (to save a single comparison!) can easily cause the stack depth to double even in non-pathological cases, and in pathological cases it can get a lot worse.

Resources