Data structure supporting O(1) remove/insert/findOldest? - algorithm

This question was asked in the interview:
Propose and implement a data structure that works with integer data from final and continuous ranges of integers. The data structure should support O(1) insert and remove operations as well findOldest (the oldest value inserted to the data structure).
No duplication is allowed (i.e. if some value already inside - it should not be added once more)
Also, if needed, the some init might be used for initialization.
I proposed a solution to use an array (size as range size) of 1/0 indicating the value is inside. It solves insert/remove and requires O(range size) initialization.
But I have no idea how to implement findOldest with the given constraints.
Any ideas?
P.S. No dynamic allocation is allowed.

I apologize if I've misinterpreted your question, but the sense I get is that
You have a fixed range of values you're considering (say, [0, N))
You need to support insertions and deletions without duplicates.
You need to support findOldest.
One option would be to build an array of length N, where each entry stores a boolean "is active" flag as well as a pointer. Additionally, each entry has a doubly-linked list cell in it. Intuitively, you're building a bitvector with a linked list threaded through it storing the insertion order.
Initially, all bits are set to false and the pointers are all NULL. When you do an insertion, set the bit on the appropriate cell to true (returning immediately if it's already set), then update the doubly-linked list of elements by appending this new cell to it. This takes time O(1). To do a findOldest step, just query the pointer to the oldest element. Finally, to do a removal step, clear the bit on the element in question and remove it from the doubly-linked list, updating the head and tail pointer if necessary.
All in all, all operations take time O(1) and no dynamic allocations are performed because the linked list cells are preallocated as part of the array.
Hope this helps!

Related

Circular Linked list and the iterator

I have a question in my algorithm class in data structures.
For which of the following representations can all basic queue operations be performed in constant worst-case time?
To perform constant worst case time for the circular linked list, where should I have to keep the iterator?
They have given two choices:
Maintain an iterator that corresponds to the first item in the list
Maintain an iterator that corresponds to the last item in the list.
My answer is that to get the worst case time we should maintain the iterator that correspond to the last item in the list but I don't know how to justify and explain. So what are important points needed for this answer justification.
For which of the following representations can all basic queue operations be performed in constant worst-case time?
My answer is that to get the worst case time we should maintain the iterator that correspond to the last item
Assuming that your circular list is singly-linked, and that "the last item" in the circular list is the one that has been inserted the latest, your answer is correct *. In order to prove that you are right, you need to demonstrate how to perform these four operations in constant time:
Get the front element - Since the queue is circular and you have an iterator pointing to the latest inserted element, the next element from the latest inserted is the front element (i.e. the earliest inserted).
Get the back element - Since you maintain an iterator pointing to the latest inserted element, getting the back of the queue is a matter of dereferencing the iterator.
Enqueue - This is a matter of inserting after the iterator that you hold, and moving the iterator to the newly inserted item.
Dequeue - Copy the content of the front element (described in #1) into a temporary variable, re-point the next link of the latest inserted element to that of the front element, and delete the front element.
Since none of these operations require iterating the list, all of them can be performed in constant time.
* With doubly-linked circular lists both answers would be correct.

LinkedList does not provide index based access, so why does it have get(index) method?

I understand that ArrayList is index based datastructure, that allows you to access its element using the index but LinkedList is not supposed index based so why does it have get(index) method that allows direct access to the element?
It may not be efficient to retrieve items from a linked list by index, but linked lists do have indices, and sometimes you just need to retrieve an item at a certain index. When that happens, it's much better to have a get method than to force users to grab an iterator and iterate to the desired position. As long as you don't call it too much or the list is small, it's fine.
This is really just an implementation decision. While an array would probably be a fairly useless data structure if you can't look up elements by index, adding a by-index lookup to a linked-list implementation doesn't do any harm (well, unless users assume it's fast - see below), and it does come in handy sometimes.
One can assign every element a number as follows:
0 1 2 3 4
Head (Element0) -> Element1 -> Element2 -> Element3 -> Element4 -> NULL
From here, it's trivial to write a function to return the element at some given index.
Note that a by-index lookup on a linked-list will be slow - if you're looking for let's say the element in the middle, you'll need to work through half the list to get there.
The previous answers imply that LinkedLists have indices.
However, a fixed index for every element in the data structure would defeat the purpose of the LinkedList and e.g. make some remove/add operations slower because the structure would need to be reindexed every time. This would take linear time, even for elements at the beginning and at the end of the list, that are crucial for Java's LinkedList's efficiency.
From Java's LinkedList implementation you can see that there is no constant time index access to the element, but rather a linear traversal where the exact element is figured out on the go.

Data design Issue to find insertion deletion and getMin in O(1)

As said in the title i need to define a datastructure that takes only O(1) time for insertion deletion and getMIn time.... NO SPACE CONSTRAINTS.....
I have searched SO for the same and all i have found is for insertion and deletion in O(1) time.... even a stack does. i saw previous post in stack overflow all they say is hashing...
with my analysis for getMIn in O(1) time we can use heap datastructure
for insertion and deletion in O(1) time we have stack...
so inorder to achieve my goal i think i need to tweak around heapdatastructure and stack...
How will i add hashing technique to this situation ...
if i use hashtable then what should my hash function look like how to analize the situation in terms of hashing... any good references will be appreciated ...
If you go with your initial assumption that insertion and deletion are O(1) complexity (if you only want to insert into the top and delete/pop from the top then a stack works fine) then in order to have getMin return the minimum value in constant time you would need to store the min somehow. If you just had a member variable keep track of the min then what would happen if it was deleted off the stack? You would need the next minimum, or the minimum relative to what's left in the stack. To do this you could have your elements in a stack contain what it believes to be the minimum. The stack is represented in code by a linked list, so the struct of a node in the linked list would look something like this:
struct Node
{
int value;
int min;
Node *next;
}
If you look at an example list: 7->3->1->5->2. Let's look at how this would be built. First you push in the value 2 (to an empty stack), this is the min because it's the first number, keep track of it and add it to the node when you construct it: {2, 2}. Then you push the 5 onto the stack, 5>2 so the min is the same push {5,2}, now you have {5,2}->{2,2}. Then you push 1 in, 1<2 so the new min is 1, push {1, 1}, now it's {1,1}->{5,2}->{2,2} etc. By the end you have:
{7,1}->{3,1}->{1,1}->{5,2}->{2,2}
In this implementation, if you popped off 7, 3, and 1 your new min would be 2 as it should be. And all of your operations is still in constant time because you just added a comparison and another value to the node. (You could use something like C++'s peek() or just use a pointer to the head of the list to take a look at the top of the stack and grab the min there, it'll give you the min of the stack in constant time).
A tradeoff in this implementation is that you'd have an extra integer in your nodes, and if you only have one or two mins in a very large list it is a waste of memory. If this is the case then you could keep track of the mins in a separate stack and just compare the value of the node that you're deleting to the top of this list and remove it from both lists if it matches. It's more things to keep track of so it really depends on the situation.
DISCLAIMER: This is my first post in this forum so I'm sorry if it's a bit convoluted or wordy. I'm also not saying that this is "one true answer" but it is the one that I think is the simplest and conforms to the requirements of the question. There are always tradeoffs and depending on the situation different approaches are required.
This is a design problem, which means they want to see how quickly you can augment existing data-structures.
start with what you know:
O(1) update, i.e. insertion/deletion, is screaming hashtable
O(1) getMin is screaming hashtable too, but this time ordered.
Here, I am presenting one way of doing it. You may find something else that you prefer.
create a HashMap, call it main, where to store all the elements
create a LinkedHashMap (java has one), call it mins where to track the minimum values.
the first time you insert an element into main, add it to mins as well.
for every subsequent insert, if the new value is less than what's at the head of your mins map, add it to the map with something equivalent to addToHead.
when you remove an element from main, also remove it from mins. 2*O(1) = O(1)
Notice that getMin is simply peeking without deleting. So just peek at the head of mins.
EDIT:
Amortized algorithm:
(thanks to #Andrew Tomazos - Fathomling, let's have some more fun!)
We all know that the cost of insertion into a hashtable is O(1). But in fact, if you have ever built a hash table you know that you must keep doubling the size of the table to avoid overflow. Each time you double the size of a table with n elements, you must re-insert the elements and then add the new element. By this analysis it would
seem that worst-case cost of adding an element to a hashtable is O(n). So why do we say it's O(1)? because not all the elements take worst-case! Indeed, only the elements where doubling occurs takes worst-case. Therefore, inserting n elements takes n+summation(2^i where i=0 to lg(n-1)) which gives n+2n = O(n) so that O(n)/n = O(1) !!!
Why not apply the same principle to the linkedHashMap? You have to reload all the elements anyway! So, each time you are doubling main, put all the elements in main in mins as well, and sort them in mins. Then for all other cases proceed as above (bullets steps).
A hashtable gives you insertion and deletion in O(1) (a stack does not because you can't have holes in a stack). But you can't have getMin in O(1) too, because ordering your elements can't be faster than O(n*Log(n)) (it is a theorem) which means O(Log(n)) for each element.
You can keep a pointer to the min to have getMin in O(1). This pointer can be updated easily for an insertion but not for the deletion of the min. But depending on how often you use deletion it can be a good idea.
You can use a trie. A trie has O(L) complexity for both insertion, deletion, and getmin, where L is the length of the string (or whatever) you're looking for. It is of constant complexity with respect to n (number of elements).
It requires a huge amount of memory, though. As they emphasized "no space constraints", they were probably thinking of a trie. :D
Strictly speaking your problem as stated is provably impossible, however consider the following:
Given a type T place an enumeration on all possible elements of that type such that value i is less than value j iff T(i) < T(j). (ie number all possible values of type T in order)
Create an array of that size.
Make the elements of the array:
struct PT
{
T t;
PT* next_higher;
PT* prev_lower;
}
Insert and delete elements into the array maintaining double linked list (in order of index, and hence sorted order) storage
This will give you constant getMin and delete.
For insertition you need to find the next element in the array in constant time, so I would use a type of radix search.
If the size of the array is 2^x then maintain x "skip" arrays where element j of array i points to the nearest element of the main array to index (j << i).
This will then always require a fixed x number of lookups to update and search so this will give constant time insertion.
This uses exponential space, but this is allowed by the requirements of the question.
in your problem statement " insertion and deletion in O(1) time we have stack..."
so I am assuming deletion = pop()
in that case, use another stack to track min
algo:
Stack 1 -- normal stack; Stack 2 -- min stack
Insertion
push to stack 1.
if stack 2 is empty or new item < stack2.peek(), push to stack 2 as well
objective: at any point of time stack2.peek() should give you min O(1)
Deletion
pop() from stack 1.
if popped element equals stack2.peek(), pop() from stack 2 as well

return inserted items for a given interval

How would one design a memory efficient system which accepts Items added into it and allows Items to be retrieved given a time interval (i.e. return Items inserted between time T1 and time T2). There is no DB involved. Items stored in-memory. What is the data structure involved and associated algorithm.
Updated:
Assume extremely high insertion rate compared to data query.
You can use a sorted data structure, where key is by time of arrival. Note the following:
items are not remvoed
items are inserted in order [if item i was inserted after item j then key(i)>key(j)].
For this reason, tree is discouraged, since it is "overpower", and insertion in it is O(logn), where you can get an O(1) insertion. I suggest using one of the followings:
(1)Array: the array will be filled up always at its end. When the allocated array is full, reallocate a bigger [double sized] array, and copy existing array to it.
Advantages: good caching is usually expected in arrays, O(1) armotorized insertion, used space is at most 2*elementSize*#elemetns
Disadvantages: high latency: when the array is full, it will take O(n) to add an element, so you need to expect that once in a while, there will be costly operation.
(2)Skip list The skip list also allows you also O(logn) seek and O(1) insertion at the end, but it doesn't have latency issues. However, it will suffer more from cache misses then an array. Space used is on average elementSize*#elements + pointerSize*#elements*2 for a skip list.
Advantages: O(1) insertion, no costly ops.
Distadvantages: bad caching is expected.
Suggestion:
I suggest using an array if latency is not an issue. If it is, you should better use a skip list.
In both, finding the desired interval is:
findInterval(T1,T2):
start <- data.find(T1)
end <- data.find(T2)
for each element in data from T1 to T2:
yield element
Either BTree or Binary Search Tree could be a good in-memory data structure to accomplish the above. Just save the timestamp in each node and you can do a range query.
You can add them all to a simple array and sort them.
Do a binary search to located both T1 and T2. All the array elements between them are what you are looking for.
This is helpful if the searching is done only after all the elements are added. If not you can use an AVL or Red-Black tree
How about a relation interval tree (encode your items as intervals containing only a single element, e.g., [a,a])? Although, it has been said already that the ratio of the anticipated operations matter (a lot actually). But here's my two cents:
I suppose an item X that is inserted at time t(X) is associated with that timestamp, right? Meaning you don't insert an item now which has a timestamp from a week ago or something. If that's the case go for the simple array and do interpolation search or something similar (your items will already be sorted according to the attribute that your query refers to, i.e., the time t(X)).
We already have an answer that suggests trees, but I think we need to be more specific: the only situation in which this is really a good solution is if you are very specific about how you build up the tree (and then I would say it's on par with the skip lists suggested in a different answer; ). The objective is to keep the tree as full as possible to the left - I'll make clearer what that means in the following. Make sure each node has a pointer to its (up to) two children and to its parent and knows the depth of the subtree rooted at that node.
Keep a pointer to the root node so that you are able to do lookups in O(log(n)), and keep a pointer to the last inserted node N (which is necessarily the node with the highest key - its timestamp will be the highest). When you are inserting a node, check how many children N has:
If 0, then replace N with the new node you are inserting and make N its left child. (At this point you'll need to update the tree depth field of at most O(log(n)) nodes.)
If 1, then add the new node as its right child.
If 2, then things get interesting. Go up the tree from N until either you find a node that has only 1 child, or the root. If you find a node with only 1 child (this is necessarily the left child), then add the new node as its new right child. If all nodes up to the root have two children, then the current tree is full. Add the new node as the new root node and the old root node as its left child. Don't change the old tree structure otherwise.
Addendum: in order to make cache behaviour and memory overhead better, the best solution is probably to make a tree or skip list of arrays. Instead of every node having a single time stamp and a single value, make every node have an array of, say, 1024 time stamps and values. When an array fills up you add a new one in the top level data structure, but in most steps you just add a single element to the end of the "current array". This wouldn't affect big-O behaviour with respect to either memory or time, but it would reduce the overhead by a factor of 1024, while latency is still very small.

Hashtable with doubly linked lists?

Introduction to Algorithms (CLRS) states that a hash table using doubly linked lists is able to delete items more quickly than one with singly linked lists. Can anybody tell me what is the advantage of using doubly linked lists instead of single linked list for deletion in Hashtable implementation?
The confusion here is due to the notation in CLRS. To be consistent with the true question, I use the CLRS notation in this answer.
We use the hash table to store key-value pairs. The value portion is not mentioned in the CLRS pseudocode, while the key portion is defined as k.
In my copy of CLR (I am working off of the first edition here), the routines listed for hashes with chaining are insert, search, and delete (with more verbose names in the book). The insert and delete routines take argument x, which is the linked list element associated with key key[x]. The search routine takes argument k, which is the key portion of a key-value pair. I believe the confusion is that you have interpreted the delete routine as taking a key, rather than a linked list element.
Since x is a linked list element, having it alone is sufficient to do an O(1) deletion from the linked list in the h(key[x]) slot of the hash table, if it is a doubly-linked list. If, however, it is a singly-linked list, having x is not sufficient. In that case, you need to start at the head of the linked list in slot h(key[x]) of the table and traverse the list until you finally hit x to get its predecessor. Only when you have the predecessor of x can the deletion be done, which is why the book states the singly-linked case leads to the same running times for search and delete.
Additional Discussion
Although CLRS says that you can do the deletion in O(1) time, assuming a doubly-linked list, it also requires you have x when calling delete. The point is this: they defined the search routine to return an element x. That search is not constant time for an arbitrary key k. Once you get x from the search routine, you avoid incurring the cost of another search in the call to delete when using doubly-linked lists.
The pseudocode routines are lower level than you would use if presenting a hash table interface to a user. For instance, a delete routine that takes a key k as an argument is missing. If that delete is exposed to the user, you would probably just stick to singly-linked lists and have a special version of search to find the x associated with k and its predecessor element all at once.
Unfortunately my copy of CLRS is in another country right now, so I can't use it as a reference. However, here's what I think it is saying:
Basically, a doubly linked list supports O(1) deletions because if you know the address of the item, you can just do something like:
x.left.right = x.right;
x.right.left = x.left;
to delete the object from the linked list, while as in a linked list, even if you have the address, you need to search through the linked list to find its predecessor to do:
pred.next = x.next
So, when you delete an item from the hash table, you look it up, which is O(1) due to the properties of hash tables, then delete it in O(1), since you now have the address.
If this was a singly linked list, you would need to find the predecessor of the object you wish to delete, which would take O(n).
However:
I am also slightly confused about this assertion in the case of chained hash tables, because of how lookup works. In a chained hash table, if there is a collision, you already need to walk through the linked list of values in order to find the item you want, and thus would need to also find its predecessor.
But, the way the statement is phrased gives clarification: "If the hash table supports deletion, then its linked lists should be doubly linked so that we can delete an item quickly. If the lists were only singly linked, then to delete element x, we would first have to find x in the list T[h(x.key)] so that we could update the next attribute of x’s predecessor."
This is saying that you already have element x, which means you can delete it in the above manner. If you were using a singly linked list, even if you had element x already, you would still have to find its predecessor in order to delete it.
I can think of one reason, but this isn't a very good one. Suppose we have a hash table of size 100. Now suppose values A and G are each added to the table. Maybe A hashes to slot 75. Now suppose G also hashes to 75, and our collision resolution policy is to jump forward by a constant step size of 80. So we try to jump to (75 + 80) % 100 = 55. Now, instead of starting at the front of the list and traversing forward 85, we could start at the current node and traverse backwards 20, which is faster. When we get to the node that G is at, we can mark it as a tombstone to delete it.
Still, I recommend using arrays when implementing hash tables.
Hashtable is often implemented as a vector of lists. Where index in vector is the key (hash).
If you don't have more than one value per key and you are not interested in any logic regarding those values a single linked list is enough. A more complex/specific design in selecting one of the values may require a double linked list.
Let's design the data structures for a caching proxy. We need a map from URLs to content; let's use a hash table. We also need a way to find pages to evict; let's use a FIFO queue to track the order in which URLs were last accessed, so that we can implement LRU eviction. In C, the data structure could look something like
struct node {
struct node *queueprev, *queuenext;
struct node **hashbucketprev, *hashbucketnext;
const char *url;
const void *content;
size_t contentlength;
};
struct node *queuehead; /* circular doubly-linked list */
struct node **hashbucket;
One subtlety: to avoid a special case and wasting space in the hash buckets, x->hashbucketprev points to the pointer that points to x. If x is first in the bucket, it points into hashbucket; otherwise, it points into another node. We can remove x from its bucket with
x->hashbucketnext->hashbucketprev = x->hashbucketprev;
*(x->hashbucketprev) = x->hashbucketnext;
When evicting, we iterate over the least recently accessed nodes via the queuehead pointer. Without hashbucketprev, we would need to hash each node and find its predecessor with a linear search, since we did not reach it via hashbucketnext. (Whether that's really bad is debatable, given that the hash should be cheap and the chain should be short. I suspect that the comment you're asking about was basically a throwaway.)
If the items in your hashtable are stored in "intrusive" lists, they can be aware of the linked list they are a member of. Thus, if the intrusive list is also doubly-linked, items can be quickly removed from the table.
(Note, though, that the "intrusiveness" can be seen as a violation of abstraction principles...)
An example: in an object-oriented context, an intrusive list might require all items to be derived from a base class.
class BaseListItem {
BaseListItem *prev, *next;
...
public: // list operations
insertAfter(BaseListItem*);
insertBefore(BaseListItem*);
removeFromList();
};
The performance advantage is that any item can be quickly removed from its doubly-linked list without locating or traversing the rest of the list.

Resources