Data structure insert,delete,mode - data-structures

Here is the interview problem: Designing a data structure for a range of integers {1,...,M} (numbers can be repeated) support insert(x), delete(x) and return mode which return the most frequently number.
The interviewer said that we can do in O(1) for all the operation with preprocessed in O(M). He also accepted that I can do insert(x) and delete(x) in O(log(n)), return mode in O(1) with preprocessed in O(M).
But I can only give in O(n) for insert(x) and delete(x) and return mode in O(1), actually how can I give O(log (n)) or/and O(1) in insert(x) and delete(x), and return mode in O(1) with preprocessed in O(M)?

When you hear O(log X) operations, the first structures that comes to mind should be a binary search tree and a heap. For reference: (since I'm focussing on a heap below)
A heap is a specialized tree-based data structure that satisfies the heap property: If A is a parent node of B then the key of node A is ordered with respect to the key of node B with the same ordering applying across the heap. ... The keys of parent nodes are always greater than or equal to those of the children and the highest key is in the root node (this kind of heap is called max heap) ....
A binary search tree doesn't allow construction (from unsorted data) in O(M), so let's see if we can make a heap work (you can create a heap in O(M)).
Clearly we want the most frequent number at the top, so this heap needs to use frequency as its ordering.
But this brings us to a problem - insert(x) and delete(x) will both require that we look through the entire heap to find the correct element.
Now you should be thinking "what if we had some sort of mapping from index to position in the tree?", and this is exactly what we're going to have. If all / most of the M elements exist, we could simply have an array, with each index i's element being a pointer to the node in the heap. If implemented correctly, this will allow us to look up the heap node in O(1), which we could then modify appropriately, and move, taking O(log M) for both insert and delete.
If only a few of the M elements exist, replacing the array with a (hash-)map (of integer to heap node) might be a good idea.
Returning the mode will take O(1).
O(1) for all operations is certainly quite a bit more difficult.
The following structure comes to mind:
3 2
^ ^
| |
5 7 4 1
12 14 15 18
To explain what's going on here - 12, 14, 15 and 18 correspond to the frequency, and the numbers above correspond to the elements with said frequency, so both 5 and 3 would have a frequency of 12, 7 and 2 would have a frequency of 14, etc.
This could be implemented as a double linked-list:
/-------\ /-------\
(12) <-> 5 <-> 3 <-> (13) <-> (14) <-> 7 <-> 2 <-> (15) <-> 4 <-> (16) <-> (18) <-> 1
^------------------/ ^------/ ^------------------/ ^------------/ ^------/
You may notice that:
I filled in the missing 13 and 16 - these are necessary, otherwise we'll have to update all elements with the same frequency when doing an insert (in this example, you would've needed to update 5 to point to 13 when doing insert(3), because 13 wouldn't have existed yet, so it would've been pointing to 14).
I skipped 17 - this is just be an optimization in terms of space usage - this makes this structure take O(M) space, as opposed to O(M + MaxFrequency). The exact conditions for skipping a number is simply that it doesn't have any elements at its frequency, or one less than its frequency.
There's some strange things going on above the linked-list. These simply mean that 5 points to 13 as well, and 7 points to 15 as well, i.e. each element also keeps a pointer to the next frequency.
There's some strange things going on below the linked-list. These simply mean that each frequency keeps a pointer to the frequency before it (this is more space efficient than each element keeping a pointer to both it's own and the next frequency).
Similarly to the above solution, we'd keep a mapping (array or map) of integer to node in this structure.
To do an insert:
Look up the node via the mapping.
Remove the node.
Get the pointer to the next frequency, insert it after that node.
Set the next frequency pointer using the element after the insert position (either it is the next frequency, in which case we can just make the pointer point to that, otherwise we can make this next frequency pointer point to the same element as that element's next frequency pointer).
To do a remove:
Look up the node via the mapping.
Remove the node.
Get the pointer to the current frequency via the next frequency, insert it before that node.
Set the next frequency pointer to that node.
To get the mode:
Return the last node.

Since range is fixed, for simplicity lets take an example M=7 (range is 1 to 7). So we need atmost 3 bit to represent each number.
0 - 000
1 - 001
2 - 010
3 - 011
4 - 100
5 - 101
6 - 110
7 - 111
Now create a b-tree with each node having 2-child (like Huffmann coding algo). Each leaf will contain the frequency of each number (initially it would be 0 for all). And address of these nodes will be saved in an array, with key as index (i.e. address for Node 1 will be at index 1 in array).
With pre-processing, we can execute insert, remove in O(1), mode in O(M) time.
insert(x) - go to location k in array, get address of node and increment counter for that node.
delete(x) - as above, just decrement counter if>0.
mode - linear search in array for maximum frequency (value of counter).

Related

How do I perform a deletion of the kth element on a min-max heap?

A min-max heap can be useful to implement a double-ended priority queue because of its constant time find-min and find-max operations. We can also retrieve the minimum and maximum elements in the min-max heap in O(log2 n) time. Sometimes, though, we may also want to delete any node in the min-max heap, and this can be done in O(log2 n) , according to the paper which introduced min-max heaps:
...
The structure can also be generalized to support the operation Find(k) (determine the kth smallest value in the structure) in constant time and the operation Delete(k) (delete the kth smallest value in the structure) in logarithmic time, for any fixed value (or set of values) of k.
...
How exactly do I perform a deletion of the kth element on a min-max heap?
I don't consider myself an "expert" in the fields of algorithms and data structures, but I do have a detailed understanding of binary heaps, including the min-max heap. See, for example, my blog series on binary heaps, starting with http://blog.mischel.com/2013/09/29/a-better-way-to-do-it-the-heap/. I have a min-max implementation that I'll get around to writing about at some point.
Your solution to the problem is correct: you do indeed have to bubble up or sift down to re-adjust the heap when you delete an arbitrary node.
Deleting an arbitrary node in a min-max heap is not fundamentally different from the same operation in a max-heap or min-heap. Consider, for example, deleting an arbitrary node in a min-heap. Start with this min-heap:
0
4 1
5 6 2 3
Now if you remove the node 5 you have:
0
4 1
6 2 3
You take the last node in the heap, 3, and put it in the place where 5 was:
0
4 1
3 6 2
In this case you don't have to sift down because it's already a leaf, but it's out of place because it's smaller than its parent. You have to bubble it up to obtain:
0
3 1
4 6 2
The same rules apply for a min-max heap. You replace the element you're removing with the last item from the heap, and decrease the count. Then, you have to check to see if it needs to be bubbled up or sifted down. The only tricky part is that the logic differs depending on whether the item is on a min level or a max level.
In your example, the heap that results from the first operation (replacing 55 with 31) is invalid because 31 is smaller than 54. So you have to bubble it up the heap.
One other thing: removing an arbitrary node is indeed a log2(n) operation. However, finding the node to delete is an O(n) operation unless you have some other data structure keeping track of where nodes are in the heap. So, in general, removal of an arbitrary node is considered O(n).
What led me to develop this solution (which I'm not 100% sure is correct) is the fact that I've actually found a solution to delete any node in a min-max heap, but it's wrong.
The wrong solution is can be found here (implemented in C++) and here (implemented in Python). I'm going to present the just mentioned wrong Python's solution, which is more accessible to everyone:
The solution is the following:
def DeleteAt(self, position):
"""delete given position"""
self.heap[position] = self.heap[-1]
del(self.heap[-1])
self.TrickleDown(position)
Now, suppose we have the following min-max heap:
level 0 10
level 1 92 56
level 2 41 54 23 11
level 3 69 51 55 65 37 31
as far as I've checked this is a valid min-max heap. Now, suppose we want to delete the element 55, which in an 0-based array would be found at index 9 (if I counted correctly).
What the solution above would do is simply put the last element in the array, in this case 31, and put it at position 9:
level 0 10
level 1 92 56
level 2 41 54 23 11
level 3 69 51 31 65 37 55
it would delete the last element of the array (which is now 55), and the resulting min-max heap would look like this:
level 0 10
level 1 92 56
level 2 41 54 23 11
level 3 69 51 31 65 37
and finally it would "trickle-down" from the position (i.e. where now we have the number 31).
"tricle-down" would check if we're in an even (or min) or odd (or max) level: we're in an odd level (3), so "trickle-down" would call "trickle-down-max" starting from 31, but since 31 has no children, it stops (check the original paper above if you don't know what I'm talking about).
But if you observe that leaves the data structure in a state that is no more a min-max heap, because 54, which is at even level and therefore should be smaller than its descendants, is greater than 31, one of its descendants.
This made me think that we couldn't just look at the children of the node at position, but that we also needed to check from that position upwards, that maybe we needed to use "trickle-up" too.
In the following reasoning, let x be the element at position after we delete the element that we wanted to delete and before any fix operations has run. Let p be its parent (if any).
The idea of my algorithm is really that one, and more specifically is based on the fact that:
If x is on a odd level (like in the example above), and we exchange it with its parent p, which is on an even level, that would not break any rules/invariants of the min-max heap from the new x's position downwards.
The same reasoning (I think) can be done if the situation would be reversed, i.e., x was originally in a even position and it would be greater than its parent.
Now, if you noticed, the only thing that could need a fix is that, if x was exchange with its parent and it's now in a even (and respectively odd) position we may need to check if it's smaller (and respectively greater) than the node at the previous even (and respectively odd) level.
This of course didn't seem to be the whole solution to me, and of course I also wanted to check if the previous parent of x, i.e. p, is in a correct position.
If p, after the exchange with x, is on a odd (and respectively even) level, it means it could be smaller (and respectively greater) than any of its descendants, because it was previously in a even (and respectively odd) level. So, I thought we needed a "trickle-down" here.
Regarding the fact if p is in a correct position with respect to its ancestors, I think the reasoning would be similar to the one above (but I'm not 100% sure).
Putting this together I came up with the solution:
function DELETE(H, i):
// H is the min-max heap array
// i is the index of the node we want to delete
// I assume, for simplicity,
// it's not out of the bounds of the array
if i is the last index of H:
remove and return H[i]
else:
l = get_last_index_of(H)
swap(H, i, l)
d = delete(H, l)
// d is the element we wanted to remove initially
// and was initially at position i
// So, at index i we now have what was the last element of H
push_up(H, i)
push_down(H, i)
return d
This seems to work according to an implementation of a min-max heap that I made and that you can find here.
Note also that the solution run in O(log2 n) time, because we're just calling "push-up" and "push-down" which run in that order.

Generator of natural numbers which returns the smallest number which was returned already but didn't freed yet

I am looking for a known algorithm for the following task:
I need an object X which provides me with two methods:
take() which returns the smallest natural number which is not taken, i.e. sequential calls of this method would return 1, 2, 3 and so on.
free(n) which marks n as not taken if it was taken already or throws exception if it wasn't or was freed after that.
Example:
take = 1
take = 2
take = 3
free(2)
take = 2
take = 4
free(3)
free(2)
free(1)
take = 1
take = 2
take = 3
take = 5
free(6) : exception
I've invented bit-set-b-tree (not sure how to call it correctly) where leaves contain the actual bit set for all taken numbers while other nodes group leaves for search purposes. Every non-leaf node has 32 children, so the memory overhead is 1/31 of bit for every bit in the leaf bitset.
The actual search of the 'hole' in the node (both leaf and non-leaf) is done as bit-operations-based binary search which takes log2(32) = 5 operations, while walk through nodes from root to leaves takes log32(L) where L is last number which is in 'taken' state.
Therefore, both operations cost 5*log32(L) and it takes about 1.032*L bits to store that structure in memory. In the worst case, L couldn't be bigger than maximal number of taken numbers at the same time. It could to not decrease even if all numbers are freed except one, but if there never were more than 10 numbers taken at the moment, L will be less or equal to ten.
What do you think about my reinvention of the wheel?
The reason why I need that structure is a very specific id generation but maybe there could be other applyings? This is a bonus question :)
Thanks for the attention.
You can also use a sorted list for the freed numbers.
Let n denote the maximum number returned by the take() function.
Every time the free(d) has been called, first search for d in the sorted list and throw exception if necessary! otherwise, insert d into the sorted list. The whole operation has a cost of O(log n).
Now, every time the the take() function has been called, simply return the minimum number in it and delete-min with cost O(1). If the sorted list was empty, increment n and return it.
Note that their is no memory overhead in this approach :)
It looks pretty efficient.
Every non-leaf node will have, on average, less than 32 children. 32 is correct maximum for the binary search time of the leaves, but maybe some smaller number d should be used for the base in 5*logd(L), either the average or minimum density.

Searching and deletion in O(log n) time using arrays

Suppose we are given a set of n numbers initially. We need to construct a data structure that supports Search(), Predecessor() and Deletion() queries.
Search and Predecessor should take worst case O(log processing n) time. Deletion should take amortized O(log n) time. Pre-time allowed is O(n). We initially have O(n) space. But I want that at any stage, if their are m numbers in the structure, the space used should be O(m)
This problem can be solved trivially using RBT, but I want a data structure that makes use of array(s) only. I don't want an implementation of RBT using arrays, the data structure shouldn't use any algorithms inspired from trees. Can anyone think of one such data structure?
I can suggest some kind of tree structure, which is easy than RBT. To simplify it description, let number of elements in it will be 2^k.
First level just numbers.
Second a level a has 2^(k-1) numbers, like in binary tree.
Next level a has 2^(k-2) numbers and so on until we have only one number.
So we have a binary tree with 2^(k+1) nodes, and parent will contain a maximum of it children values. To build this tree we will be O(N) time. Tree will consumer O(n) space, first level consumer n, second n/2, third n/4 ... and so on, so total space will be n + n/2 + n/4 + ... = 2n = O(n). E.g. we will have following tree for numbers 1,2,4,6,8,9,12,14:
1 2 4 7 8 9 12 14
2 7 9 14
7 14
14
To delete element we need find it using binary search and mark them null in list. Update tree, if both children NULL put NULL to the node. It will take k operation (the tree height) or O(log(N)).
E.g. we delete 12.
1 2 4 7 8 9 NULL(12) 14
2 7 9 14
7 14
14
Now delete 7.
1 2 4 NUll(7) 8 9 NULL(12) 14
2 4 9 14
4 14
14
Now we should search or predecessor in O(log(N)). By binary search find in first level our element or it predecessor. If it is not NULL we get the answer. If NULL go level up, we have three variants here:
Node has NULL value, go to upper level
Node's left child has a not NULL value. It is smaller then requested number (due tree structure) and it is the answer.
Node's value is bigger the then requested number, go upper.

min/max number of records on a B+Tree?

I was looking at the best & worst case scenarios for a B+Tree (http://en.wikipedia.org/wiki/B-tree#Best_case_and_worst_case_heights) but I don't know how to use this formula with the information I have.
Let's say I have a tree B with 1,000 records, what is the maximum (and maximum) number of levels B can have?
I can have as many/little keys on each page. I can also have as many/little number of pages.
Any ideas?
(In case you are wondering, this is not a homework question, but it will surely help me understand some stuff for hw.)
I don't have the math handy, but...
Basically, the primary factor to tree depth is the "fan out" of each node in the tree.
Normally, in a simply B-Tree, the fan out is 2, 2 nodes as children for each node in the tree.
But with a B+Tree, typically they have a fan out much larger.
One factor that comes in to play is the size of the node on disk.
For example, if you have a 4K page size, and, say, 4000 byte of free space (not including any other pointers or other meta data related to the node), and lets say that a pointer to any other node in the tree is a 4 byte integer. If your B+Tree is in fact storing 4 byte integers, then the combined size (4 bytes of pointer information + 4 bytes of key information) = 8 bytes. 4000 free bytes / 8 bytes == 500 possible children.
That give you a fan out of 500 for this contrived case.
So, with one page of index, i.e. the root node, or a height of 1 for the tree, you can reference 500 records. Add another level, and you're at 500*500, so for 501 4K pages, you can reference 250,000 rows.
Obviously, the large the key size, or the smaller the page size of your node, the lower the fan out that the tree is capable of. If you allow variable length keys in each node, then the fan out can easily vary.
But hopefully you can see the gist of how this all works.
It depends on the arity of the tree. You have to define this value. If you say that each node can have 4 children then and you have 1000 records, then the height is
Best case log_4(1000) = 5
Worst case log_{4/2}(1000) = 10
The arity is m and the number of records is n.
The best and worst case depends on the no. of children each node can have. For the best case, we consider the case, when each node has the maximum number of children (i.e. m for an m-ary tree) with each node having m-1 keys. So,
1st level(or root) has m-1 entries
2nd level has m*(m-1) entries (since the root has m children with m-1 keys each)
3rd level has m^2*(m-1) entries
....
Hth level has m^(h-1)*(m-1)
Thus, if H is the height of the tree, the total number of entries is equal to n=m^H-1
which is equivalent to H=log_m(n+1)
Hence, in your case, if you have n=1000 records with each node having m children (m should be odd), then the best case height will be equal to log_m(1000+1)
Similarly, for the worst case scenario:
Level 1(root) has at least 1 entry (and minimum 2 children)
2nd level has as least 2*(d-1) entries (where d=ceil(m/2) is the minimum number of children each internal node (except root) can have)
3rd level has 2d*(d-1) entries
...
Hth level has 2*d^(h-2)*(d-1) entries
Thus, if H is the height of the tree, the total number of entries is equal to n=2*d^H-1 which is equivalent to H=log_d((n+1)/2+1)
Hence, in your case, if you have n=1000 records with each node having m children (m should be odd), then the worst case height will be equal to log_d((1000+1)/2+1)

Sequentially Constructing Full B-Trees

If I have a sorted set of data, which I want to store on disk in a way that is optimal for both reading sequentially and doing random lookups on, it seems that a B-Tree (or one of the variants is a good choice ... presuming this data-set does not all fit in RAM).
The question is can a full B-Tree be constructed from a sorted set of data without doing any page splits? So that the sorted data can be written to disk sequentially.
Constructing a "B+ tree" to those specifications is simple.
Choose your branching factor k.
Write the sorted data to a file. This is the leaf level.
To construct the next highest level, scan the current level and write out every kth item.
Stop when the current level has k items or fewer.
Example with k = 2:
0 1|2 3|4 5|6 7|8 9
0 2 |4 6 |8
0 4 |8
0 8
Now let's look for 5. Use binary search to find the last number less than or equal to 5 in the top level, or 0. Look at the interval in the next lowest level corresponding to 0:
0 4
Now 4:
4 6
Now 4 again:
4 5
Found it. In general, the jth item corresponds to items jk though (j+1)k-1 at the next level. You can also scan the leaf level linearly.
We can make a B-tree in one pass, but it may not be the optimal storage method. Depending on how often you make sequential queries vs random access ones, it may be better to store it in sequence and use binary search to service a random access query.
That said: assume that each record in your b-tree holds (m - 1) keys (m > 2, the binary case is a bit different). We want all the leaves on the same level and all the internal nodes to have at least (m - 1) / 2 keys. We know that a full b-tree of height k has (m^k - 1) keys. Assume that we have n keys total to store. Let k be the smallest integer such that m^k - 1 > n. Now if 2 m^(k - 1) - 1 < n we can completely fill up the inner nodes, and distribute the rest of the keys evenly to the leaf nodes, each leaf node getting either the floor or ceiling of (n + 1 - m^(k - 1))/m^(k - 1) keys. If we cannot do that then we know that we have enough to fill all of the nodes at depth k - 1 at least halfway and store one key in each of the leaves.
Once we have decided the shape of our tree, we need only do an inorder traversal of the tree sequentially dropping keys into position as we go.
Optimal meaning that an inorder traversal of the data will always be seeking forward through the file (or mmaped region), and a random lookup is done in a minimal number of seeks.

Resources