Don't skip lists have the same limitation as arrays? - algorithm

Would I be right in saying, Skip Lists like arrays can be very efficient if one knows the 'n' (number of elements to be stored) beforehand?
Maxlevel of a skip list is (log n + 1), and since I need to know the maxlevel before creating the skip list it means I should have an idea of what could be the number of elements that are gonna be stored.

You don't need to know the max level of a skiplist before you create it, as long as you are prepared to dynamically resize the head, whose size is precisely maxlevel. Since maxlevel is approximately log N, resizing the head happens rarely and when it does happen, it involves very little work. If you really wanted to avoid even that, you could initially create the head with a capacity sufficient for the entire available storage, although that would be a waste of space if most of your skiplists were a few hundred elements.
This all works because the search procedure for a skiplist never moves up a level; only down. So inserting a new node with a higher level than any existing node does not require modifying any existing node except the head, which must be modified to point to the new node at its highest level. (Otherwise, the new level will be useless.)
As a curious implementation detail, it's not necessary to store the size of a node; the fact that a node is pointed to at level i is sufficient to demonstrate that the node has at least i levels, so there is never any need to compare i with the size of the node. Only the size of the head needs to be known.

maxlevel can be dynamically increased.
use maxlevel=2^(n-1) for the first (2^n-1) nodes. when the 2^n node come, it is assigned with level=2^n, and the next 2^n...2^(n+1) nodes use maxlevel=2^n.

Related

Algorithm for selection the most frequent object during factorization

I have N objects, and M sets of those objects. Sets are non-empty, different, and may intersect. Typically M and N are of the same order of magnitude, usually M > N.
Historically my sets were encoded as-is, each just contained a table (array) of its objects, but I'd like to create a more optimized encoding. Typically some objects present in most of the sets, and I want to utilize this.
My idea is to represent sets as stacks (i.e. single-directional linked lists), whereas their bottom parts can be shared across different sets. It can also be defined as a tree, whereas each node/leaf has a pointer to its parent, but not children.
Such a data structure will allow to use the most common subsets of objects as roots, which all the appropriate sets may "inherit".
The most efficient encoding is computed by the following algorithm. I'll write it as a recursive pseudo-code.
BuildAllChains()
{
BuildSubChains(allSets, NULL);
}
BuildSubChains(sets, pParent)
{
if (sets is empty)
return;
trgObj = the most frequent object from sets;
pNode = new Node;
pNode->Object = trgObj;
pNode->pParent = pParent;
newSets = empty;
for (each set in sets that contains the trgObj)
{
remove trgObj from set;
remove set from sets;
if (set is empty)
set->pHead = pNode;
else
newSets.Insert(set);
}
BuildSubChains(sets, pParent);
BuildSubChains(newSets, pNode);
}
Note: the pseudo-code is written in a recursive manner, but technically naive recursion should not be used, because at each point the splitting is not balanced, and in a degenerate case (which is likely, since the source data isn't random) the recursion depth would be O(N).
Practically I use a combination of loop + recursion, whereas recursion always invoked on a smaller part.
So, the idea is to select each time the most common object, create a "subset" which inherits its parent subset, and all the sets that include it, as well as all the predecessors selected so far - should be based on this subset.
Now, I'm trying to figure-out an effective way to select the most frequent object from the sets. Initially my idea was to compute the histogram of all the objects, and sort it once. Then, during the recursion, whenever we remove an object and select only sets that contain/don't contain it - deduce the sorted histogram of the remaining sets. But then I realized that this is not trivial, because we remove many sets, each containing many objects.
Of course we can select each time the most frequent object directly, i.e. O(N*M). But it also looks inferior, in a degenerate case, where an object exists in either almost all or almost none sets we may need to repeat this O(N) times. OTOH for those specific cases in-place adjustment of the sorted histogram may be preferred way to go.
So far I couldn't come up with a good enough solution. Any ideas would be appreciated. Thanks in advance.
Update:
#Ivan: first thanks a lot for the answer and the detailed analysis.
I do store the list of elements within the histogram rather than the count only. Actually I use pretty sophisticated data structures (not related to STL) with intrusive containers, corss-linked pointers and etc. I planned this from the beginning, because than it seemed to me that the histogram adjustment after removing elements would be trivial.
I think the main point of your suggestion, which I didn't figure-out myself, is that at each step the histograms should only contain elements that are still present in the family, i.e. they must not contain zeroes. I thought that in cases where the splitting is very uneven creating a new histogram for the smaller part is too expensive. But restricting it to only existing elements is a really good idea.
So we remove sets of the smaller family, adjust the "big" histogram and build the "small" one. Now, I need some clarifications about how to keep the big histogram sorted.
One idea, which I thought about first, was immediate fix of the histogram after every single element removal. I.e. for every set we remove, for every object in the set, remove it from the histogram, and if the sort is broken - swap the histogram element with its neighbor until the sort is restored.
This seems good if we remove small number of objects, we don't need to traverse the whole histogram, we do a "micro-bubble" sort.
However when removing large number of objects it seems better to just remove all the objects and then re-sort the array via quick-sort.
So, do you have a better idea regarding this?
Update2:
I think about the following: The histogram should be a data structure which is a binary search tree (auto-balanced of course), whereas each element of the tree contains the appropriate object ID and the list of the sets it belongs to (so far). The comparison criteria is the size of this list.
Each set should contain the list of objects it contains now, whereas the "object" has the direct pointer to the element histogram. In addition each set should contain the number of objects matched so far, set to 0 at the beginning.
Technically we need a cross-linked list node, i.e. a structure that exists in 2 linked lists simultaneously: in the list of a histogram element, and in the list of the set. This node also should contain pointers to both the histogram item and the set. I call it a "cross-link".
Picking the most frequent object is just finding the maximum in the tree.
Adjusting such a histogram is O(M log(N)), whereas M is the number of elements that are currently affected, which is smaller than N if only a little number is affected.
And I'll also use your idea to build the smaller histogram and adjust the bigger.
Sounds right?
I denote the total size of sets with T. The solution I present works in time O(T log T log N).
For the clarity I denote with set the initial sets and with family the set of these sets.
Indeed, let's store a histogram. In BuildSubChains function we maintain a histogram of all elements which are presented in the sets at the moment, sorted by frequency. It may be something like std::set of pairs (frequency, value), maybe with cross-references so you could find an element by value. Now taking the most frequent element is straightforward: it is the first element in the histogram. However, maintaining it is trickier.
You split your family of sets into two subfamilies, one containing the most frequent element, one not. Let there total sizes be T' and T''. Take the family with the smallest total size and remove all elements from its sets from the histogram, making the new histogram on the run. Now you have a histogram for both families, and it is built in time O(min(T', T'') log n), where log n comes from operations with std::set.
At the first glance it seems that it works in quadratic time. However, it is faster. Take a look at any single element. Every time we explicitly remove this element from the histogram the size of its family at least halves, so each element will directly participate in no more than log T removals. So there will be O(T log T) operations with histograms in total.
There might be a better solution if I knew the total size of sets. However, no solution can be faster than O(T), and this is only logarithmically slower.
There may be one more improvement: if you store in the histogram not only elements and frequencies, but also the sets that contain the element (simply another std::set for each element) you'll be able to efficiently select all sets that contain the most frequent element.

A*, what's the best data structure for the open list?

Disclaimer: I really believe that this is not a duplicate of similar questions. I've read those, and they (mostly) recommend using a heap or a priority queue. My question is more of the "I don't understand how those would work in this case" kind.
In short:
I'm referring to the typical A* (A-star) pathfinding algorithm, as described (for example) on Wikipedia:
https://en.wikipedia.org/wiki/A*_search_algorithm
More specifically, I'm wondering about what's the best data structure (which can be a single well known data structure, or a combination of those) to use so that you never have O(n) performance on any of the operations that the algorithm requires to do on the open list.
As far as I understand (mostly from the Wikipedia article), the operations needed to be done on the open list are as follows:
The elements in this list need to be Node instances with the following properties:
position (or coordinates). For the sake of argument, let's say this is a positive integer ranging in value from 0 to 64516 (I'm limiting my A* area size to 254x254, which means that any set of coordinates can be bit-encoded on 16 bits)
F score. This is positive floating point value.
Given these, the operations are:
Add a node to the open list: if a node with the same position (coordinates) exists (but, potentially, with a different F score), replace it.
Retrieve (and remove) from the open list the node with the lowest F score
(Check if exists and) retrieve from the list a node for a given position (coordinates)
As far as I can see, the problem with using a Heap or Priority Queue for the open list are:
These data structure will use the F-score as sorting criteria
As such, adding a node to this kind of data structure is problematic: how do you check optimally that a node with a similar set of coordinates (but a different F Score) doesn't already exist. Furthermore, even if you somehow are able to do this check, if you actually find such a node, but it is not on the top of the Heap/Queue, how to you optimally remove it such that the Heap/Queue keeps its correct order
Also, checking for existence and removing a node based on its position is not optimal or even possible: if we use a Priority Queue, we have to check every node in it, and remove the corresponding one if found. For a heap, if such a removal is necessary, I imagine that all remaining elements need to be extracted and re-inserted, so that the heap still remains a heap.
The only remaining operating where such a data structure would be good is when we want to remove the node with the lowest F-score. In this case the operation would be O(Log(n)).
Also, if we make a custom data structure, such as one that uses a Hashtable (with position as key) and a Priority Queue, we would still have some operations that require suboptimal processing on either of these: In order to keep them in sync (both should have the same nodes in them), for a given operation, that operation will always be subomtimal on one of the data structures: adding or removing a node by position would be fast on the Hashtable but slow on the Priority Queue. Removing the node with the lowest F score would be fast on the Priority Queue but slow on the Hashtable.
What I've done is make a custom Hashtable for the nodes that uses their position as key, that also keeps track of the current node with the lowest F score. When adding a new node, it checks if its F score is lower than the currently stored lowest F score node, and if so, it replaces it. The problem with this data structure comes when you want to remove a node (whether by position or the lowest F scored one). In this case, in order to update the field holding the current lowest F score node, I need to iterate through all the remaining node in order to find which one has the lowest F score now.
So my question is: is there a better way to store these ?
You can combine the hash table and the heap without slow operations showing up.
Have the hash table map position to index in the heap instead of node.
Any update to the heap can sync itself (which requires the heap to know about the hash table, so this is invasive and not just a wrapper around two off-the-shelf implementations) to the hash table with as many updates (each O(1), obviously) as the number of items that move in the heap, of course only log n items can move for an insertion, remove-min or update-key. The hash table finds the node (in the heap) to update the key of for the parent-updating/G-changing step of A* so that's fast too.

Linked Lists and Sentinal node

So i have been asked in my homework to merge-sort 2 different sorted circular linked lists without useing a sentinal node allso the lists can be empty, my questsion is what is a sentinal node in the first place?
a sentinal node is a node that contains no real data - it's just there for the convenience of the implementation.
Thus a list with 4 real elements might have one or more extra nodes, making a total of 5 or 6 nodes.
Those extra nodes might be place holders (e.g. marking where you started the merge), pseudo-nodes indicating the head of the list, or anything else the algorithm designer can think up.
A sentinel node is a node that you add to your code to avoid handling degeneracies with special code. For merge sort for example, you can add a node with value = INFINITY to the end of both lists that you want to merge, this guarantees that once you hit the end of a list you can't go beyond that because the value is always greater (or equal) to the values in the other list.
So if you are not using a sentinel, you have to write code to handle this. In your merge routine, you should check that you've reached the end..
Sentinel node is a traversal path terminator in linked lists and tree. It doesn't hold or reference any data managed by the data structure. One of the benefit is to Reduce algorithmic complexity and code size.In your case the complexity,code size will be increased and speed of operation will be decreased.

return inserted items for a given interval

How would one design a memory efficient system which accepts Items added into it and allows Items to be retrieved given a time interval (i.e. return Items inserted between time T1 and time T2). There is no DB involved. Items stored in-memory. What is the data structure involved and associated algorithm.
Updated:
Assume extremely high insertion rate compared to data query.
You can use a sorted data structure, where key is by time of arrival. Note the following:
items are not remvoed
items are inserted in order [if item i was inserted after item j then key(i)>key(j)].
For this reason, tree is discouraged, since it is "overpower", and insertion in it is O(logn), where you can get an O(1) insertion. I suggest using one of the followings:
(1)Array: the array will be filled up always at its end. When the allocated array is full, reallocate a bigger [double sized] array, and copy existing array to it.
Advantages: good caching is usually expected in arrays, O(1) armotorized insertion, used space is at most 2*elementSize*#elemetns
Disadvantages: high latency: when the array is full, it will take O(n) to add an element, so you need to expect that once in a while, there will be costly operation.
(2)Skip list The skip list also allows you also O(logn) seek and O(1) insertion at the end, but it doesn't have latency issues. However, it will suffer more from cache misses then an array. Space used is on average elementSize*#elements + pointerSize*#elements*2 for a skip list.
Advantages: O(1) insertion, no costly ops.
Distadvantages: bad caching is expected.
Suggestion:
I suggest using an array if latency is not an issue. If it is, you should better use a skip list.
In both, finding the desired interval is:
findInterval(T1,T2):
start <- data.find(T1)
end <- data.find(T2)
for each element in data from T1 to T2:
yield element
Either BTree or Binary Search Tree could be a good in-memory data structure to accomplish the above. Just save the timestamp in each node and you can do a range query.
You can add them all to a simple array and sort them.
Do a binary search to located both T1 and T2. All the array elements between them are what you are looking for.
This is helpful if the searching is done only after all the elements are added. If not you can use an AVL or Red-Black tree
How about a relation interval tree (encode your items as intervals containing only a single element, e.g., [a,a])? Although, it has been said already that the ratio of the anticipated operations matter (a lot actually). But here's my two cents:
I suppose an item X that is inserted at time t(X) is associated with that timestamp, right? Meaning you don't insert an item now which has a timestamp from a week ago or something. If that's the case go for the simple array and do interpolation search or something similar (your items will already be sorted according to the attribute that your query refers to, i.e., the time t(X)).
We already have an answer that suggests trees, but I think we need to be more specific: the only situation in which this is really a good solution is if you are very specific about how you build up the tree (and then I would say it's on par with the skip lists suggested in a different answer; ). The objective is to keep the tree as full as possible to the left - I'll make clearer what that means in the following. Make sure each node has a pointer to its (up to) two children and to its parent and knows the depth of the subtree rooted at that node.
Keep a pointer to the root node so that you are able to do lookups in O(log(n)), and keep a pointer to the last inserted node N (which is necessarily the node with the highest key - its timestamp will be the highest). When you are inserting a node, check how many children N has:
If 0, then replace N with the new node you are inserting and make N its left child. (At this point you'll need to update the tree depth field of at most O(log(n)) nodes.)
If 1, then add the new node as its right child.
If 2, then things get interesting. Go up the tree from N until either you find a node that has only 1 child, or the root. If you find a node with only 1 child (this is necessarily the left child), then add the new node as its new right child. If all nodes up to the root have two children, then the current tree is full. Add the new node as the new root node and the old root node as its left child. Don't change the old tree structure otherwise.
Addendum: in order to make cache behaviour and memory overhead better, the best solution is probably to make a tree or skip list of arrays. Instead of every node having a single time stamp and a single value, make every node have an array of, say, 1024 time stamps and values. When an array fills up you add a new one in the top level data structure, but in most steps you just add a single element to the end of the "current array". This wouldn't affect big-O behaviour with respect to either memory or time, but it would reduce the overhead by a factor of 1024, while latency is still very small.

Efficient way to handle adding and removing items by bitwise And

So, suppose you have a collection of items. Each item has an identifier which can be represented using a bitfield. As a simple example, suppose your collection is:
0110, 0111, 1001, 1011, 1110, 1111
So, you then want to implement a function, Remove(bool bitval, int position). For example, a call to Remove(0, 2) would remove all items where index 2(i.e. 3rd bit) was 0. In this case, that would be 1001, only. Remove(1,1) would remove 1110, 1111, 0111, and 0110. It is trivial to come up with an O(n) collection where this is possible (just use a linked list), with n being the number of items in the collection. In general the number of items to be removed is going to be O(n) (assuming a given bit has a ≥ c% chance of being 1 and a ≥ c% chance of being 0, where c is some constant > 0), so "better" algorithms which somehow are O(l), with l being the number of items being removed, are unexciting.
Is it possible to define a data structure where the average (or better yet, worst case) removal time is better than O(n)? A binary tree can do pretty well (just remove all left/right branches at the height m, where m is the index being tested), but I'm wondering if there is any way to do better (and quite honestly, I'm not sure how to removing all left or right branches at a particular height in an efficient manner). Alternatively, is there a proof that doing better is not possible?
Edit: I'm not sure exactly what I'm expecting in terms of efficiency (sorry Arno), but a basic explanation of it's possible application is thus: Suppose we are working with a binary decision tree. Such a tree could be used for a game tree or a puzzle solver or whatever. Further suppose the tree is small enough that we can fit all of the leaf nodes into memory. Each such node is basically just a bitfield listing all of the decisions. Now, if we want to prune arbitrary decisions from this tree, one method would be to just jump to the height where a particular decision is made and prune the left or right side of every node (left meaning one decision, right meaning the other). Normally in a decision tree you only want to prune subtree at a time (since the parent of that subtree is different from the parent of other subtrees and thus the decision which should be pruned in one subtree should not be pruned from others), but in some types of situations this may not be the case. Further, you normally only want to prune everything below a particular node, but in this case you'll be leaving some stuff below the node but also pruning below other nodes in the tree.
Anyhow, this is somewhat of a question based on curiousity; I'm not sure it's practical to use any results, but am interested in what people have to say.
Edit:
Thinking about it further, I think the tree method is actually O(n / logn), assuming it's reasonably dense. Proof:
Suppose you have a binary tree with n items. It's height is log(n). Removing half the bottom will require n/2 removals. Removing the half the row above will require n/4. The sum of operations for each row is n-1. So the average number of removals is n-1 / log(n).
Provided the length of your bitfields is limited, the following may work:
First, represent the bitfields that are in the set as an array of booleans, so in your case (4 bit bitfields), new bool[16];
Transform this array of booleans into a bitfield itself, so a 16-bit bitfield in this case, where each bit represents whether the bitfield corresponding to its index is included
Then operations become:
Remove(0, 0) = and with bitmask 1010101010101010
Remove(1, 0) = and with bitmask 0101010101010101
Remove(0, 2) = and with bitmask 1111000011110000
Note that more complicated 'add/remove' operations could then also be added as O(1) bit-logic.
The only down-side is that extra work is needed to interpret the resulting 16-bit bitfield back into a set of values, but with lookup arrays that might not turn out too bad either.
Addendum:
Additional down-sides:
Once the size of an integer is exceeded, every added bit to the original bit-fields will double the storage space. However, this is not much worse than a typical scenario using another collection where you have to store on average half the possible bitmask values (provided the typical scenario doesn't store far less remaining values).
Once the size of an integer is exceeded, every added bit also doubles the number of 'and' operations needed to implement the logic.
So basically, I'd say if your original bitfields are not much larger than a byte, you are likely better off with this encoding, beyond that you're probably better off with the original strategy.
Further addendum:
If you only ever execute Remove operations, which over time thins out the set state-space further and further, you may be able to stretch this approach a bit further (no pun intended) by making a more clever abstraction that somehow only keeps track of the int values that are non-zero. Detecting zero values may not be as expensive as it sounds either if the JIT knows what it's doing, because a CPU 'and' operation typically sets the 'zero' flag if the result is zero.
As with all performance optimizations, this one'd need some measurement to determine if it is worthwile.
If each decision bit and position are listed as objects, {bit value, k-th position}, you would end up with an array of length 2*k. If you link to each of these array positions from your item, represented as a linked list (which are of length k), using a pointer to the {bit, position} object as the node value, you can "invalidate" a bunch of items by simply deleting the {bit, position} object. This would require you, upon searching the list of items, to find "complete" items (it makes search REALLY slow?).
So something like:
[{0,0}, {1,0}, {0,1}, {1, 1}, {0,2}, {1, 2}, {0,3}, {1,3}]
and linked from "0100", represented as: {0->3->4->6}
You wouldn't know which items were invalid until you tried to find them (so it doesn't really limit your search space, which is what you're after).
Oh well, I tried.
Sure, it is possible (even if this is "cheating"). Just keep a stack of Remove objects:
struct Remove {
bool set;
int index;
}
The remove function just pushes an object on the stack. Viola, O(1).
If you wanted to get fancy, your stack couldn't exceed (number of bits) without containing duplicate or impossible scenarios.
The rest of the collection has to apply the logic whenever things are withdrawn or iterated over.
Two ways to do insert into the collection:
Apply the Remove rules upon insert, to clear out the stack, making in O(n). Gotta pay somewhere.
Each bitfield has to store it's index in the remove stack, to know what rules apply to it. Then, the stack size limit above wouldn't matter
If you use an array to store your binary tree, you can quickly index any element (the children of the node at index n are at index (n+1)*2 and (n+1)*2-1. All the nodes at a given level are stored sequentially. The first node at at level x is 2^x-1 and there are 2^x elements at that level.
Unfortunately, I don't think this really gets you much of anywhere from a complexity standpoint. Removing all the left nodes at a level is O(n/2) worst case, which is of course O(n). Of course the actual work depends on which bit you are checking, so the average may be somewhat better. This also requires O(2^n) memory which is much worse than the linked list and not practical at all.
I think what this problem is really asking is for a way to efficiently partition a set of sets into two sets. Using a bitset to describe the set gives you a fast check for membership, but doesn't seem to lend itself to making the problem any easier.

Resources