List Data Structure that for Fast Index Finding - data-structures

I need to store a large number of elements in a list-like data structure. The additional requirement is that it should be fast to determine the index for each element at any time. The elements are not sorted, and sorting is not possible.
If a simple array is used, we will have to use a linear search each time the index for an element is queried. This works, but it is a very inefficient solution. Here is the data structure in pseudo code:
class IndexList1 {
Array elements
getIndex(e) {
for (i = 0; i < elements.length; i++) {
if (elements[i] == e) {
return i
}
}
return -1;
}
insertAt(e, i) {
elements.insertAt(e, i)
}
removeFrom(i) {
elements.removeFrom(i)
}
}
A better approach is to map the element to the index either in an internal map or as a field in the element. This way no search is necessary to determine the index of an element. The problem here is that after inserting or removing an element into/from the middle of the list, all indexes after the insertion point have to be updated. For frequent insertions and deletions and a large number of elements this is not very efficient.
In pseudo code the improved data structure looks like this:
class IndexList2 {
Array elements
Map elementIndexMap
getIndex(e) {
return elementIndexMap.get(e)
}
insertAt(e, i) {
elements.insertAt(e, i)
updateIndicesFrom(i)
}
removeFrom(i) {
elementsIndexMap.delete(elements[i])
elements.removeFrom(i)
updateIndicesFrom(i)
}
updateIndicesFrom(i) {
for (; i < elements.length; i++) {
elementsIndexMap[elements[i]] = i
}
}
}
Is there a clever list-like data structure for a large number of elements that keeps track of the index for each element, even with many insertions and deletions, in an efficient way? By efficient I mean O(1) for getIndex() and something better than O(n) for insert() and remove().

The usual way to solve this problem is to put the items in some kind of balanced tree (red-black tree, B+tree, skip-list, etc.) with parent pointers, and to label each node in the tree with the size of its subtree.
Then you can find the index of any item by walking up to the root, and adding up the sizes of any subtrees to its left.
It's like an order statistic tree (https://en.wikipedia.org/wiki/Order_statistic_tree) without sorting the keys.

Related

Generalizing the find-min/find-max stack to arbitrary order statistics?

In this earlier question, the OP asked for a data structure similar to a stack supporting the following operations in O(1) time each:
Push, which adds a new element atop the stack,
Pop, which removes the top element from the stack,
Find-Max, which returns (but does not remove) the largest element of the stack, and
Find-Min, which returns (but does not remove) the smallest element of the stack.
A few minutes ago I found this related question asking for a clarification on a similar data structure that instead of allowing for the max and min to be queried, allows for the median element of the stack to be queried. These two data structures seem to be a special case of a more general data structure supporting the following operations:
Push, which pushes an element atop the stack,
Pop, which pops the top of the stack, and
Find-Kth, which for a fixed k determined when the structure is created, returns the kth largest element of the stack.
It is possible to support all of these operations by storing a stack and an balanced binary search tree holding the top k elements, which would enable all these operations to run in O(log k) time. My question is this: is it possible to implement the above data structure faster than this? That is, could we get O(1) for all three operations? Or perhaps O(1) for push and pop and O(log k) for the order statistic lookup?
Since the structure can be used to sort k elements with O(k) push and find-kth operations, every comparison-based implementation has at least one of these cost Omega(log k), even in an amortized sense, with randomization.
Push can be O(log k) and pop/find-kth can be O(1) (use persistent data structures; push should precompute the order statistic). My gut feeling based on working with lower bounds for comparison-based algorithms is that O(1) push/pop and O(log k) find-kth is doable but requires amortization.
I think what tophat was saying is, implement a purely functional data structure that supports only O(log k) insert and O(1) find-kth (cached by insert), and then make a stack of these structures. Push inserts into the top version and pushes the update, pop pops the top version, and find-kth operates on the top version. This is O(log k)/O(1)/(1) but super-linear space.
EDIT: I was working on O(1) push/O(1) pop/O(log k) find-kth, and I think it can't be done. The sorting algorithm that tophat referred to can be adapted to get √k evenly spaced order statistics of a length-k array in time O(k + (√k) log k). Problem is, the algorithm must know how each order statistic compares with all other elements (otherwise it might be wrong), which means that it has bucketed everything into one of √k + 1 buckets, which takes Ω(k log (√k + 1)) = Ω(k log k) comparisons on information theoretic grounds. Oops.
Replacing √k by keps for any eps > 0, with O(1) push/O(1) pop, I don't think find-kth can be O(k1 - eps), even with randomization and amortization.
Whether this is actually faster than your log k implementation, depending on which operations are used most frequently, I propose an implementation with O(1) Find-kth and Pop and O(n) Push, where n is the stack size. And I also want to share this with SO because it is just a hilarious data structure at first sight, but might even be reasonable.
It's best described by a doubly doubly linked stack, or perhaps more easily dscribed as a hybrid of a linked stack and a doubly linked sorted list. Basically each node maintains 4 references to other nodes, the next and previous in stack order and the next and previous in sorted order on the element size. These two linked lists can be implemented using the same nodes, but they work completely seperately, i.e. the sorted linked list doesn't have to know about the stack order and vice versa.
Like a normal linked stack, the collection itself will need to maintain a reference to the top node (and to the bottom?). To accomodate the O(1) nature of the Find-kth method, the collection will also keep a reference to the kth largest element.
The pop method works as follows:
The popped node gets removed from the sorted doubly linked list, just like a removal from a normal sorted linked list. It takes O(1) as the collection has a reference to the top. Depending on whether the popped element was larger or smaller than the kth element, the reference to the kth largest element is set to either the previous or the next. So the method still has O(1) complexity.
The push method works just like a normal addition to a sorted linked list, which is a O(n) operation. It start with the smallest element, and inserts the new node when a larger element is encountered. To maintain the correct reference to the kth largest element, again either the previous or next element to the current kth largest element is selected, depending on whether the pushed node was larger or smaller than the kth largest element.
Of course next to this, the reference to the 'top' of the stack has to be set in both methods. Also there's the problem of k > n, for which you haven't specified what the data structure should do. I hope it is clear how it works, otherwise I could add an example.
But ok, not entirely the complexity you had hoped for, but I find this an interesting 'solution'.
Edit: An implementation of the described structure
A bounty was issued on this question, which indicates my original answer wasn’t good enough:P Perhaps the OP would like to see an implementation?
I have implemented both the median problem and the fixed-k problem, in C#. The implementation of the tracker of the median is just a wrapper around the tracker of the kth element, where k can mutate.
To recap the complexities:
Push takes O(n)
Pop takes O(1)
FindKth takes O(1)
Change k takes O(delta k)
I have already described the algorithm in reasonable detail in my original post. The implementation is then fairly straightforward(but not so trivial to get right, as there are a lot of inequality signs and if statements to consider). I have commented only to indicate what is done, but not the details of how, as it would otherwise become too large. The code is already quite lengthy for a SO post.
I do want to provide the contracts of all non-trivial public members:
K is the index of the element in the sorted linked list to keep a reference too. Is it mutable and when set, the structure is immediately corrected for that.
KthValue is the value at that index, unless the structure doesn’t have k elements yet, in which case it returns a default value.
HasKthValue exists to easily distinguish these default values from elements which happened to be the default value of its type.
Constructors: a null enumerable is interpreted as an empty enumerable, and a null comparer is interpreted as the default. This comparer defines the order used when determining the kth value.
So this is the code:
public sealed class KthTrackingStack<T>
{
private readonly Stack<Node> stack;
private readonly IComparer<T> comparer;
private int k;
private Node smallestNode;
private Node kthNode;
public int K
{
get { return this.k; }
set
{
if (value < 0) throw new ArgumentOutOfRangeException();
for (; k < value; k++)
{
if (kthNode.NextInOrder == null)
return;
kthNode = kthNode.NextInOrder;
}
for (; k >= value; k--)
{
if (kthNode.PreviousInOrder == null)
return;
kthNode = kthNode.PreviousInOrder;
}
}
}
public T KthValue
{
get { return HasKthValue ? kthNode.Value : default(T); }
}
public bool HasKthValue
{
get { return k < Count; }
}
public int Count
{
get { return this.stack.Count; }
}
public KthTrackingStack(int k, IEnumerable<T> initialElements = null, IComparer<T> comparer = null)
{
if (k < 0) throw new ArgumentOutOfRangeException("k");
this.k = k;
this.comparer = comparer ?? Comparer<T>.Default;
this.stack = new Stack<Node>();
if (initialElements != null)
foreach (T initialElement in initialElements)
this.Push(initialElement);
}
public void Push(T value)
{
//just a like a normal sorted linked list should the node before the inserted node be found.
Node nodeBeforeNewNode;
if (smallestNode == null || comparer.Compare(value, smallestNode.Value) < 0)
nodeBeforeNewNode = null;
else
{
nodeBeforeNewNode = smallestNode;//untested optimization: nodeBeforeNewNode = comparer.Compare(value, kthNode.Value) < 0 ? smallestNode : kthNode;
while (nodeBeforeNewNode.NextInOrder != null && comparerCompare(value, nodeBeforeNewNode.NextInOrder.Value) > 0)
nodeBeforeNewNode = nodeBeforeNewNode.NextInOrder;
}
//the following code includes the new node in the ordered linked list
Node newNode = new Node
{
Value = value,
PreviousInOrder = nodeBeforeNewNode,
NextInOrder = nodeBeforeNewNode == null ? smallestNode : nodeBeforeNewNode.NextInOrder
};
if (newNode.NextInOrder != null)
newNode.NextInOrder.PreviousInOrder = newNode;
if (newNode.PreviousInOrder != null)
newNode.PreviousInOrder.NextInOrder = newNode;
else
smallestNode = newNode;
//the following code deals with changes to the kth node due the adding the new node
if (kthNode != null && comparer.Compare(value, kthNode.Value) < 0)
{
if (HasKthValue)
kthNode = kthNode.PreviousInOrder;
}
else if (!HasKthValue)
{
kthNode = newNode;
}
stack.Push(newNode);
}
public T Pop()
{
Node result = stack.Pop();
//the following code deals with changes to the kth node
if (HasKthValue)
{
if (comparer.Compare(result.Value, kthNode.Value) <= 0)
kthNode = kthNode.NextInOrder;
}
else if(kthNode.PreviousInOrder != null || Count == 0)
{
kthNode = kthNode.PreviousInOrder;
}
//the following code maintains the order in the linked list
if (result.NextInOrder != null)
result.NextInOrder.PreviousInOrder = result.PreviousInOrder;
if (result.PreviousInOrder != null)
result.PreviousInOrder.NextInOrder = result.NextInOrder;
else
smallestNode = result.NextInOrder;
return result.Value;
}
public T Peek()
{
return this.stack.Peek().Value;
}
private sealed class Node
{
public T Value { get; set; }
public Node NextInOrder { get; internal set; }
public Node PreviousInOrder { get; internal set; }
}
}
public class MedianTrackingStack<T>
{
private readonly KthTrackingStack<T> stack;
public void Push(T value)
{
stack.Push(value);
stack.K = stack.Count / 2;
}
public T Pop()
{
T result = stack.Pop();
stack.K = stack.Count / 2;
return result;
}
public T Median
{
get { return stack.KthValue; }
}
public MedianTrackingStack(IEnumerable<T> initialElements = null, IComparer<T> comparer = null)
{
stack = new KthTrackingStack<T>(initialElements == null ? 0 : initialElements.Count()/2, initialElements, comparer);
}
}
Of course you're always free to ask any question about this code, as I realize some things may not be obvious from the description and sporadic comments
The only actual working implementation I can wrap my head around is Push/Pop O(log k) and Kth O(1).
Stack (single linked)
Min Heap (size k)
Stack2 (doubly linked)
The value nodes will be shared between the Stack, Heap and Stack2
PUSH:
Push to the stack
If value >= heap root
If heap size < k
Insert value in heap
Else
Remove heap root
Push removed heap root to stack2
Insert value in heap
POP:
Pop from the stack
If popped node has stack2 references
Remove from stack2 (doubly linked list remove)
If popped node has heap references
Remove from the heap (swap with last element, perform heap-up-down)
Pop from stack2
If element popped from stack2 is not null
Insert element popped from stack2 into heap
KTH:
If heap is size k
Return heap root value
You could use a skip list . (I first thought of linked-list, but insertion is O(n) and amit corrected me with skip list. I think this data structure could be pretty interesting in your case)
With this data structure, inserting/deleting would take O(ln(k))
and finding the maximum O(1)
I would use :
a stack, containing your elements
a a stack containing the history of skip list (containing the k smallest elements)
(I realised it was the Kth largest..element. but it's pretty much the same problem)
when pushing (O(ln(k)):
if the element is less the kth element, delete the kth element (O(ln(k)) put it in the LIFO pile (O(1)) then insert the element in the skip list O(ln(k))
otherwise it's not in the skip list just put it on the pile (O(1))
When pushing you add a new skip list to the history, since this is similar to a copy on write it wouldn't take more than O(ln(k))
when popping (O(1):
you just pop from both stacks
getting kth element O(1):
always take the maximum element in the list (O(1))
All the ln(k) are amortised cost.
Example:
I will take the same example as yours (on Stack with find-min/find-max more efficient than O(n)) :
Suppose that we have a stack and add the values 2, 7, 1, 8, 3, and 9, in that order. and k = 3
I will represent it this way :
[number in the stack] [ skip list linked with that number]
first I push 2,7 and 1 (it doesn't make sens to look for the kth element in a list of less than k elements)
1 [7,2,1]
7 [7,2,null]
2 [2,null,null]
If I want the kth element I just need to take the max in the linked list: 7
now I push 8,3, 9
on the top of the stack I have :
8 [7,2,1] since 8 > kth element therefore skip list doesn't change
then :
3 [3,2,1] since 3 < kth element, the kth element has changed. I first delete 7 who was the previous kth element (O(ln(k))) then insert 3 O(ln(k)) => total O(ln(k))
then :
9 [3,2,1] since 9 > kth element
Here is the stack I get :
9 [3,2,1]
3 [3,2,1]
8 [7,2,1]
1 [7,2,1]
7 [7,2,null]
2 [2,null,null]
find k th element :
I get 3 in O(1)
now I can pop 9 and 3 (takes O(1)):
8 [7,2,1]
1 [7,2,1]
7 [7,2,null]
2 [2,null,null]
find kth element :
I get 7 in O(1)
and push 0 (takes O(ln(k) - insertion)
0 [2,1,0]
8 [7,2,1]
1 [7,2,1]
7 [7,2,null]
2 [2,null,null]
#tophat is right - since this structure could be used to implement a sort, it can't have less complexity than an equivalent sort algorithm. So how do you do a sort in less than O(lg N)? Use Radix Sort.
Here is an implementation which makes use of a Binary Trie. Inserting items into a binary Trie is essentially the same operation as performing a radix sort. The cost for inserting and deleting s O(m), where m is a constant: the number of bits in the key. Finding the next largest or smallest key is also O(m), accomplished by taking the next step in an in-order depth-first traversal.
So the general idea is to use the values pushed onto the stack as keys in the trie. The data to store is the occurance count of that item in the stack. For each pushed item: if it exists in the trie, increment its count, else store it with a count of 1. When you pop an item, find it, decrement the count, and remove it if the count is now 0. Both those operations are O(m).
To get O(1) FindKth, keep track of 2 values: The value of the Kth item, and how many instances of that value are in the first K item. (for example, for K=4 and a stack of [1,2,3,2,0,2], the Kth value is 2 and the "iCount" is 2.) Then when you push values < the KthValue, you simply decrement the instance count, and if it is 0, do a FindPrev on the trie to get the next smaller value.
When you pop values greater than the KthValue, increment the instance count if more instances of that vaue exist, else do a FindNext to get the next larger value.
(The rules are different if there are less than K items. In that case, you can simply track the max inserted value. When there are K items, the max will be the Kth.)
Here is a C implementation. It relies on a BinaryTrie (built using the example at PineWiki as a base) with this interface:
BTrie* BTrieInsert(BTrie* t, Item key, int data);
BTrie* BTrieFind(BTrie* t, Item key);
BTrie* BTrieDelete(BTrie* t, Item key);
BTrie* BTrieNextKey(BTrie* t, Item key);
BTrie* BTriePrevKey(BTrie* t, Item key);
Here is the Push function.
void KSStackPush(KStack* ks, Item val)
{
BTrie* node;
//resize if needed
if (ks->ct == ks->sz) ks->stack = realloc(ks->stack,sizeof(Item)*(ks->sz*=2));
//push val
ks->stack[ks->ct++]=val;
//record count of value instances in trie
node = BTrieFind(ks->trie, val);
if (node) node->data++;
else ks->trie = BTrieInsert(ks->trie, val, 1);
//adjust kth if needed
ksCheckDecreaseKth(ks,val);
}
Here is the helper to track the KthValue
//check if inserted val is in set of K
void ksCheckDecreaseKth(KStack* ks, Item val)
{
//if less than K items, track the max.
if (ks->ct <= ks->K) {
if (ks->ct==1) { ks->kthValue = val; ks->iCount = 1;} //1st item
else if (val == ks->kthValue) { ks->iCount++; }
else if (val > ks->kthValue) { ks->kthValue = val; ks->iCount = 1;}
}
//else if value is one of the K, decrement instance count
else if (val < ks->kthValue && (--ks->iCount<=0)) {
//if that was only instance in set,
//find the previous value, include all its instances
BTrie* node = BTriePrev(ks->trie, ks->kthValue);
ks->kthValue = node->key;
ks->iCount = node->data;
}
}
Here is the Pop function
Item KSStackPop(KStack* ks)
{
//pop val
Item val = ks->stack[--ks->ct];
//find in trie
BTrie* node = BTrieFind(ks->trie, val);
//decrement count, remove if no more instances
if (--node->data == 0)
ks->trie = BTrieDelete(ks->trie, val);
//adjust kth if needed
ksCheckIncreaseKth(ks,val);
return val;
}
And the helper to increase the KthValue
//check if removing val causes Kth to increase
void ksCheckIncreaseKth(KStack* ks, Item val)
{
//if less than K items, track max
if (ks->ct < ks->K)
{ //if removing the max,
if (val==ks->kthValue) {
//find the previous node, and set the instance count.
BTrie* node = BTriePrev(ks->trie, ks->kthValue);
ks->kthValue = node->key;
ks->iCount = node->data;
}
}
//if removed val was among the set of K,add a new item
else if (val <= ks->kthValue)
{
BTrie* node = BTrieFind(ks->trie, ks->kthValue);
//if more instances of kthValue exist, add 1 to set.
if (node && ks->iCount < node->data) ks->iCount++;
//else include 1 instance of next value
else {
BTrie* node = BTrieNext(ks->trie, ks->kthValue);
ks->kthValue = node->key;
ks->iCount = 1;
}
}
}
So this is algorithm is O(1) for all 3 operations. It can also support the Median operation: Start with KthValue = the first value, and whenever stack size changes by 2, do an IncreaseKth or DecreasesKth operation. The downside is that the constant is large. It is only a win when m < lgK. However, for small keys and large K, this may be good choice.
What if you paired the stack with a pair of Fibonacci Heaps? That could give amortized O(1) Push and FindKth, and O(lgN) delete.
The stack stores [value, heapPointer] pairs. The heaps store stack pointers.
Create one MaxHeap, one MinHeap.
On Push:
if MaxHeap has less than K items, insert the stack top into the MaxHeap;
else if the new value is less than the top of the MaxHeap, first insert the result of DeleteMax in the MinHeap, then insert the new item into MaxHeap;
else insert it into the MinHeap. O(1) (or O(lgK) if DeleteMax is needed)
On FindKth, return the top of the MaxHeap. O(1)
On Pop, also do a Delete(node) from the popped item's heap.
If it was in the MinHeap, you are done. O(lgN)
If it was in the MaxHeap, also perform a DeleteMin from the MinHeap and Insert the result in the MaxHeap. O(lgK)+O(lgN)+O(1)
Update:
I realized I wrote it up as K'th smallest, not K'th largest.
I also forgot a step when a new value is less than the current K'th smallest. And that step
pushes the worst case insert back to O(lg K). This may still be ok for uniformly distributed input and small K, as it will only hit that case on K/N insertions.
*moved New Idea to different answer - it got too large.
Use a Trie to store your values. Tries already have an O(1) insert complexity. You only need to worry about two things, popping and searching, but if you tweak your program a little, it would be easy.
When inserting (pushing), have a counter for each path that stores the number of elements inserted there. This will allow each node to keep track of how many elements have been inserted using that path, i.e. the number represents the number of elements that are stored beneath that path. That way, when you try to look for the kth element, it would be a simple comparison at each path.
For popping, you can have a static object that has a link to the last stored object. That object can be accessed from the root object, hence O(1). Of course, you would need to add functions to retrieve the last object inserted, which means the newly pushed node must have a pointer to the previously pushed element (implemented in the push procedure; very simple, also O(1)). You also need to decrement the counter, which means each node must have a pointer to the parent node (also simple).
For finding kth element (this is for smallest kth element, but finding the largest is very similar): when you enter each node you pass in k and the minimum index for the branch (for the root it would be 0). Then you do a simple if comparison for each path: if (k between minimum index and minimum index + pathCounter), you enter that path passing in k and the new minimum index as (minimum index + sum of all previous pathCounters, excluding the one you took). I think this is O(1), since increasing the number data within a certain range doesn't increase the difficulty of finding k.
I hope this helps, and if anything is not very clear, just let me know.

trie or balanced binary search tree to store dictionary?

I have a simple requirement (perhaps hypothetical):
I want to store english word dictionary (n words) and given a word (character length m), the dictionary is able to tell, if the word exists in dictionary or not.
What would be an appropriate data structure for this?
a balanced binary search tree? as done in C++ STL associative data structures like set,map
or
a trie on strings
Some complexity analysis:
in a balanced bst, time would be (log n)*m (comparing 2 string takes O(m) time character by character)
in trie, if at each node, we could branch out in O(1) time, we can find using O(m), but the assumption that at each node, we can branch in O(1) time is not valid. at each node, max possible branches would be 26. if we want O(1) at a node, we will keep a short array indexible on characters at each node. This will blow-up the space. After a few levels in the trie, branching will reduce, so its better to keep a linked list of next node characters and pointers.
what looks more practical? any other trade-offs?
Thanks,
I'd say use a Trie, or better yet use its more space efficient cousin the Directed Acyclic Word Graph (DAWG).
It has the same runtime characteristics (insert, look up, delete) as a Trie but overlaps common suffixes as well as common prefixes which can be a big saving on space.
If this is C++, you should also consider std::tr1::unordered_set. (If you have C++0x, you can use std::unordered_set.)
This just uses a hash table internally, which I would wager will out-perform any tree-like structure in practice. It is also trivial to implement because you have nothing to implement.
Binary search is going to be easier to implement and it's only going to involve comparing tens of strings at the most. Given you know the data up front, you can build a balanced binary tree so performance is going to be predictable and easily understood.
With that in mind, I'd use a standard binary tree (probably using set from C++ since that's typically implemented as a tree).
A simple solution is to store the dict as sorted, \n-separated words on disk, load it into memory and do a binary search. The only non-standard part here is that you have to scan backwards for the start of a word when you're doing the binary search.
Here's some code! (It assumes globals wordlist pointing to the loaded dict, and wordlist_end which points to just after the end of the loaded dict.
// Return >0 if word > word at position p.
// Return <0 if word < word at position p.
// Return 0 if word == word at position p.
static int cmp_word_at_index(size_t p, const char *word) {
while (p > 0 && wordlist[p - 1] != '\n') {
p--;
}
while (1) {
if (wordlist[p] == '\n') {
if (*word == '\0') return 0;
else return 1;
}
if (*word == '\0') {
return -1;
}
int char0 = toupper(*word);
int char1 = toupper(wordlist[p]);
if (char0 != char1) {
return (int)char0 - (int)char1;
}
++p;
++word;
}
}
// Test if a word is in the dictionary.
int is_word(const char* word_to_find) {
size_t index_min = 0;
size_t index_max = wordlist_end - wordlist;
while (index_min < index_max - 1) {
size_t index = (index_min + index_max) / 2;
int c = cmp_word_at_index(index, word_to_find);
if (c == 0) return 1; // Found word.
if (c < 0) {
index_max = index;
} else {
index_min = index;
}
}
return 0;
}
A huge benefit of this approach is that the dict is stored in a human-readable way on disk, and that you don't need any fancy code to load it (allocate a block of memory and read() it in in one go).
If you want to use a trie, you could use a packed and suffix-compressed representation. Here's a link to one of Donald Knuth's students, Franklin Liang, who wrote about this trick in his thesis.
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.123.7018&rep=rep1&type=pdf
It uses half the storage of the straightforward textual dict representation, gives you the speed of a trie, and you can (like the textual dict representation) store the whole thing on disk and load it in one go.
The trick it uses is to pack all the trie nodes into a single array, interleaving them where possible. As well as a new pointer (and an end-of-word marker bit) in each array location like in a regular trie, you store the letter that this node is for -- this lets you tell if the node is valid for your state or if it's from an overlapping node. Read the linked doc for a fuller and clearer explanation, as well as an algorithm for packing the trie into this array.
It's not trivial to implement the suffix-compression and greedy packing algorithm described, but it's easy enough.
Industry standard is to store the dictionary in a hashtable and have an amortized O(1) lookup time. Space is no more critical in industry especially due to the advancement in distributive computing.
hashtable is how google implement its autocomplete feature. Specifically have every prefix of a word as a key and put the word as the value in the hashtable.

How to find the rank of a node in an AVL tree?

I need to implement two rank queries [rank(k) and select(r)]. But before I can start on this, I need to figure out how the two functions work.
As far as I know, rank(k) returns the rank of a given key k, and select(r) returns the key of a given rank r.
So my questions are:
1.) How do you calculate the rank of a node in an AVL(self balancing BST)?
2.) Is it possible for more than one key to have the same rank? And if so, what woulud select(r) return?
I'm going to include a sample AVL tree which you can refer to if it helps answer the question.
Thanks!
Your question really boils down to: "how is the term 'rank' normally defined with respect to an AVL tree?" (and, possibly, how is 'select' normally defined as well).
At least as I've seen the term used, "rank" means the position among the nodes in the tree -- i.e., how many nodes are to its left. You're typically given a pointer to a node (or perhaps a key value) and you need to count the number of nodes to its left.
"Select" is basically the opposite -- you're given a particular rank, and need to retrieve a pointer to the specified node (or the key for that node).
Two notes: First, since neither of these modifies the tree at all, it makes no real difference what form of balancing is used (e.g., AVL vs. red/black); for that matter a tree with no balancing at all is equivalent as well. Second, if you need to do this frequently, you can improve speed considerably by adding an extra field to each node recording how many nodes are to its left.
Rank is the number of nodes in the Left sub tree plus one, and is calculated for every node. I believe rank is not a concept specific to AVL trees - it can be calculated for any binary tree.
Select is just opposite to rank. A rank is given and you have to return a node matching that rank.
The following code will perform rank calculation:
void InitRank(struct TreeNode *Node)
{
if(!Node)
{
return;
}
else
{ Node->rank = 1 + NumeberofNodeInTree(Node->LChild);
InitRank(Node->LChild);
InitRank(Node->RChild);
}
}
int NumeberofNodeInTree(struct TreeNode *Node)
{
if(!Node)
{
return 0;
}
else
{
return(1+NumeberofNodeInTree(Node->LChild)+NumeberofNodeInTree(Node->RChild));
}
}
Here is the code i wrote and worked fine for AVL Tree to get the rank of a particular value. difference is just you used a node as parameter and i used a key a parameter. you can modify this as your own way. Sample code:
public int rank(int data){
return rank(data,root);
}
private int rank(int data, AVLNode r){
int rank=1;
while(r != null){
if(data<r.data)
r = r.left;
else if(data > r.data){
rank += 1+ countNodes(r.left);
r = r.right;
}
else{
r.rank=rank+countNodes(r.left);
return r.rank;
}
}
return 0;
}
[N.B] If you want to start your rank from 0 then initialize variable rank=0.
you definitely should have implemented the method countNodes() to execute this code.

Efficient algorithm to remove any map that is contained in another map from a collection of maps

I have set (s) of unique maps (Java HashMaps currently) and wish to remove from it any maps that are completely contained by some other map in the set (i.e. remove m from s if m.entrySet() is a subset of n.entrySet() for some other n in s.)
I have an n^2 algorithm, but it's too slow. Is there a more efficient way to do this?
Edit:
the set of possible keys is small, if that helps.
Here is an inefficient reference implementation:
public void removeSubmaps(Set<Map> s) {
Set<Map> toRemove = new HashSet<Map>();
for (Map a: s) {
for (Map b : s) {
if (a.entrySet().containsAll(b.entrySet()))
toRemove.add(b);
}
}
s.removeAll(toRemove);
}
Not sure I can make this anything other than an n^2 algorithm, but I have a shortcut that might make it faster. Make a list of your maps with the length of the each map and sort it. A proper subset of a map must be shorter or equal to the map you're comparing - there's never any need to compare to a map higher on the list.
Here's another stab at it.
Decompose all your maps into a list of key,value,map number. Sort the list by key and value. Go through the list, and for each group of key/value matches, create a permutation of all the map number pairs - these are all potential subsets. When you have the final list of pairs, sort by map numbers. Go through this second list, and count the number of occurrences of each pair - if the number matches the size of one of the maps, you've found a subset.
Edit: My original interpretation of the problem was incorrect, here is new answer based on my re-read of the question.
You can create a custom hash function for HashMap which returns the product of all hash value of its entries. Sort the list of hash value and start loop from biggest value and find all divisor from smaller hash values, these are possible subsets of this hashmap, use set.containsAll() to confirm before marking them for removal.
This effectively transforms the problem into a mathematical problem of finding possible divisor from a collection. And you can apply all the common divisor-search optimizations.
Complexity is O(n^2), but if many hashmaps are subsets of others, the actual time spent can be a lot better, approaching O(n) in best-case scenario (if all hashmaps are subset of one). But even in worst case scenario, division calculation would be a lot faster than set.containsAll() which itself is O(n^2) where n is number of items in a hashmap.
You might also want to create a simple hash function for hashmap entry objects to return smaller numbers to increase multiply/division performance.
Here's a subquadratic (O(N**2 / log N)) algorithm for finding maximal sets from a set of sets: An Old Sub-Quadratic Algorithm for Finding Extremal Sets.
But if you know your data distribution, you can do much better in average case.
This what I ended up doing. It works well in my situation as there is usually some value that is only shared by a small number of maps. Kudos to Mark Ransom for pushing me in this direction.
In prose: Index the maps by key/value pair, so that each key/value pair is associated with a set of maps. Then, for each map: Find the smallest set associated with one of it's key/value pairs; this set is typically small for my data. Each of the maps in this set is a potential 'supermap'; no other map could be a 'supermap' as it would not contain this key/value pair. Search this set for a supermap. Finally remove all the identified submaps from the original set.
private <K, V> void removeSubmaps(Set<Map<K, V>> maps) {
// index the maps by key/value
List<Map<K, V>> mapList = toList(maps);
Map<K, Map<V, List<Integer>>> values = LazyMap.create(HashMap.class, ArrayList.class);
for (int i = 0, uniqueRowsSize = mapList.size(); i < uniqueRowsSize; i++) {
Map<K, V> row = mapList.get(i);
Integer idx = i;
for (Map.Entry<K, V> entry : row.entrySet())
values.get(entry.getKey()).get(entry.getValue()).add(idx);
}
// find submaps
Set<Map<K, V>> toRemove = Sets.newHashSet();
for (Map<K, V> submap : mapList) {
// find the smallest set of maps with a matching key/value
List<Integer> smallestList = null;
for (Map.Entry<K, V> entry : submap.entrySet()) {
List<Integer> list = values.get(entry.getKey()).get(entry.getValue());
if (smallestList == null || list.size() < smallestList.size())
smallestList = list;
}
// compare with each of the maps in that set
for (int i : smallestList) {
Map<K, V> map = mapList.get(i);
if (isSubmap(submap, map))
toRemove.add(submap);
}
}
maps.removeAll(toRemove);
}
private <K,V> boolean isSubmap(Map<K, V> submap, Map<K,V> map){
if (submap.size() >= map.size())
return false;
for (Map.Entry<K,V> entry : submap.entrySet()) {
V other = map.get(entry.getKey());
if (other == null)
return false;
if (!other.equals(entry.getValue()))
return false;
}
return true;
}

Algorithm to tell if two arrays have identical members

What's the best algorithm for comparing two arrays to see if they have the same members?
Assume there are no duplicates, the members can be in any order, and that neither is sorted.
compare(
[a, b, c, d],
[b, a, d, c]
) ==> true
compare(
[a, b, e],
[a, b, c]
) ==> false
compare(
[a, b, c],
[a, b]
) ==> false
Obvious answers would be:
Sort both lists, then check each
element to see if they're identical
Add the items from one array to a
hashtable, then iterate through the
other array, checking that each item
is in the hash
nickf's iterative search algorithm
Which one you'd use would depend on whether you can sort the lists first, and whether you have a good hash algorithm handy.
You could load one into a hash table, keeping track of how many elements it has. Then, loop over the second one checking to see if every one of its elements is in the hash table, and counting how many elements it has. If every element in the second array is in the hash table, and the two lengths match, they are the same, otherwise they are not. This should be O(N).
To make this work in the presence of duplicates, track how many of each element has been seen. Increment while looping over the first array, and decrement while looping over the second array. During the loop over the second array, if you can't find something in the hash table, or if the counter is already at zero, they are unequal. Also compare total counts.
Another method that would work in the presence of duplicates is to sort both arrays and do a linear compare. This should be O(N*log(N)).
Assuming you don't want to disturb the original arrays and space is a consideration, another O(n.log(n)) solution that uses less space than sorting both arrays is:
Return FALSE if arrays differ in size
Sort the first array -- O(n.log(n)) time, extra space required is the size of one array
For each element in the 2nd array, check if it's in the sorted copy of
the first array using a binary search -- O(n.log(n)) time
If you use this approach, please use a library routine to do the binary search. Binary search is surprisingly error-prone to hand-code.
[Added after reviewing solutions suggesting dictionary/set/hash lookups:]
In practice I'd use a hash. Several people have asserted O(1) behaviour for hashes, leading them to conclude a hash-based solution is O(N). Typical inserts/lookups may be close to O(1), and some hashing schemes guarantee worst-case O(1) lookup, but worst-case insertion -- in constructing the hash -- isn't O(1). Given any particular hashing data structure, there would be some set of inputs which would produce pathological behaviour. I suspect there exist hashing data structures with the combined worst-case to [insert-N-elements then lookup-N-elements] of O(N.log(N)) time and O(N) space.
You can use a signature (a commutative operation over the array members) to further optimize this in the case where the array are usually different, saving the o(n log n) or the memory allocation.
A signature can be of the form of a bloom filter(s), or even a simple commutative operation like addition or xor.
A simple example (assuming a long as the signature side and gethashcode as a good object identifier; if the objects are, say, ints, then their value is a better identifier; and some signatures will be larger than long)
public bool MatchArrays(object[] array1, object[] array2)
{
if (array1.length != array2.length)
return false;
long signature1 = 0;
long signature2 = 0;
for (i=0;i<array1.length;i++) {
signature1=CommutativeOperation(signature1,array1[i].getHashCode());
signature2=CommutativeOperation(signature2,array2[i].getHashCode());
}
if (signature1 != signature2)
return false;
return MatchArraysTheLongWay(array1, array2);
}
where (using an addition operation; use a different commutative operation if desired, e.g. bloom filters)
public long CommutativeOperation(long oldValue, long newElement) {
return oldValue + newElement;
}
This can be done in different ways:
1 - Brute force: for each element in array1 check that element exists in array2. Note this would require to note the position/index so that duplicates can be handled properly. This requires O(n^2) with much complicated code, don't even think of it at all...
2 - Sort both lists, then check each element to see if they're identical. O(n log n) for sorting and O(n) to check so basically O(n log n), sort can be done in-place if messing up the arrays is not a problem, if not you need to have 2n size memory to copy the sorted list.
3 - Add the items and count from one array to a hashtable, then iterate through the other array, checking that each item is in the hashtable and in that case decrement count if it is not zero otherwise remove it from hashtable. O(n) to create a hashtable, and O(n) to check the other array items in the hashtable, so O(n). This introduces a hashtable with memory at most for n elements.
4 - Best of Best (Among the above): Subtract or take difference of each element in the same index of the two arrays and finally sum up the subtacted values. For eg A1={1,2,3}, A2={3,1,2} the Diff={-2,1,1} now sum-up the Diff = 0 that means they have same set of integers. This approach requires an O(n) with no extra memory. A c# code would look like as follows:
public static bool ArrayEqual(int[] list1, int[] list2)
{
if (list1 == null || list2 == null)
{
throw new Exception("Invalid input");
}
if (list1.Length != list2.Length)
{
return false;
}
int diff = 0;
for (int i = 0; i < list1.Length; i++)
{
diff += list1[i] - list2[i];
}
return (diff == 0);
}
4 doesn't work at all, it is the worst
If the elements of an array are given as distinct, then XOR ( bitwise XOR ) all the elements of both the arrays, if the answer is zero, then both the arrays have the same set of numbers. The time complexity is O(n)
I would suggest using a sort first and sort both first. Then you will compare the first element of each array then the second and so on.
If you find a mismatch you can stop.
If you sort both arrays first, you'd get O(N log(N)).
What is the "best" solution obviously depends on what constraints you have. If it's a small data set, the sorting, hashing, or brute force comparison (like nickf posted) will all be pretty similar. Because you know that you're dealing with integer values, you can get O(n) sort times (e.g. radix sort), and the hash table will also use O(n) time. As always, there are drawbacks to each approach: sorting will either require you to duplicate the data or destructively sort your array (losing the current ordering) if you want to save space. A hash table will obviously have memory overhead to for creating the hash table. If you use nickf's method, you can do it with little-to-no memory overhead, but you have to deal with the O(n2) runtime. You can choose which is best for your purposes.
Going on deep waters here, but:
Sorted lists
sorting can be O(nlogn) as pointed out. just to clarify, it doesn't matter that there is two lists, because: O(2*nlogn) == O(nlogn), then comparing each elements is another O(n), so sorting both then comparing each element is O(n)+O(nlogn) which is: O(nlogn)
Hash-tables:
Converting the first list to a hash table is O(n) for reading + the cost of storing in the hash table, which i guess can be estimated as O(n), gives O(n). Then you'll have to check the existence of each element in the other list in the produced hash table, which is (at least?) O(n) (assuming that checking existance of an element the hash-table is constant). All-in-all, we end up with O(n) for the check.
The Java List interface defines equals as each corresponding element being equal.
Interestingly, the Java Collection interface definition almost discourages implementing the equals() function.
Finally, the Java Set interface per documentation implements this very behaviour. The implementation is should be very efficient, but the documentation makes no mention of performance. (Couldn't find a link to the source, it's probably to strictly licensed. Download and look at it yourself. It comes with the JDK) Looking at the source, the HashSet (which is a commonly used implementation of Set) delegates the equals() implementation to the AbstractSet, which uses the containsAll() function of AbstractCollection using the contains() function again from hashSet. So HashSet.equals() runs in O(n) as expected. (looping through all elements and looking them up in constant time in the hash-table.)
Please edit if you know better to spare me the embarrasment.
Pseudocode :
A:array
B:array
C:hashtable
if A.length != B.length then return false;
foreach objA in A
{
H = objA;
if H is not found in C.Keys then
C.add(H as key,1 as initial value);
else
C.Val[H as key]++;
}
foreach objB in B
{
H = objB;
if H is not found in C.Keys then
return false;
else
C.Val[H as key]--;
}
if(C contains non-zero value)
return false;
else
return true;
The best way is probably to use hashmaps. Since insertion into a hashmap is O(1), building a hashmap from one array should take O(n). You then have n lookups, which each take O(1), so another O(n) operation. All in all, it's O(n).
In python:
def comparray(a, b):
sa = set(a)
return len(sa)==len(b) and all(el in sa for el in b)
Ignoring the built in ways to do this in C#, you could do something like this:
Its O(1) in the best case, O(N) (per list) in worst case.
public bool MatchArrays(object[] array1, object[] array2)
{
if (array1.length != array2.length)
return false;
bool retValue = true;
HashTable ht = new HashTable();
for (int i = 0; i < array1.length; i++)
{
ht.Add(array1[i]);
}
for (int i = 0; i < array2.length; i++)
{
if (ht.Contains(array2[i])
{
retValue = false;
break;
}
}
return retValue;
}
Upon collisions a hashmap is O(n) in most cases because it uses a linked list to store the collisions. However, there are better approaches and you should hardly have collisions anyway because if you did the hashmap would be useless. In all regular cases it's simply O(1). Besides that, it's not likely to have more than a small n of collisions in a single hashmap so performance wouldn't suck that bad; you can safely say that it's O(1) or almost O(1) because the n is so small it's can be ignored.
Here is another option, let me know what you guys think.It should be T(n)=2n*log2n ->O(nLogn) in the worst case.
private boolean compare(List listA, List listB){
if (listA.size()==0||listA.size()==0) return true;
List runner = new ArrayList();
List maxList = listA.size()>listB.size()?listA:listB;
List minList = listA.size()>listB.size()?listB:listA;
int macthes = 0;
List nextList = null;;
int maxLength = maxList.size();
for(int i=0;i<maxLength;i++){
for (int j=0;j<2;j++) {
nextList = (nextList==null)?maxList:(maxList==nextList)?minList:maList;
if (i<= nextList.size()) {
MatchingItem nextItem =new MatchingItem(nextList.get(i),nextList)
int position = runner.indexOf(nextItem);
if (position <0){
runner.add(nextItem);
}else{
MatchingItem itemInBag = runner.get(position);
if (itemInBag.getList != nextList) matches++;
runner.remove(position);
}
}
}
}
return maxLength==macthes;
}
public Class MatchingItem{
private Object item;
private List itemList;
public MatchingItem(Object item,List itemList){
this.item=item
this.itemList = itemList
}
public boolean equals(object other){
MatchingItem otheritem = (MatchingItem)other;
return otheritem.item.equals(this.item) and otheritem.itemlist!=this.itemlist
}
public Object getItem(){ return this.item}
public Object getList(){ return this.itemList}
}
The best I can think of is O(n^2), I guess.
function compare($foo, $bar) {
if (count($foo) != count($bar)) return false;
foreach ($foo as $f) {
foreach ($bar as $b) {
if ($f == $b) {
// $f exists in $bar, skip to the next $foo
continue 2;
}
}
return false;
}
return true;
}

Resources