Suppose we are given a set of n numbers initially. We need to construct a data structure that supports Search(), Predecessor() and Deletion() queries.
Search and Predecessor should take worst case O(log processing n) time. Deletion should take amortized O(log n) time. Pre-time allowed is O(n). We initially have O(n) space. But I want that at any stage, if their are m numbers in the structure, the space used should be O(m)
This problem can be solved trivially using RBT, but I want a data structure that makes use of array(s) only. I don't want an implementation of RBT using arrays, the data structure shouldn't use any algorithms inspired from trees. Can anyone think of one such data structure?
I can suggest some kind of tree structure, which is easy than RBT. To simplify it description, let number of elements in it will be 2^k.
First level just numbers.
Second a level a has 2^(k-1) numbers, like in binary tree.
Next level a has 2^(k-2) numbers and so on until we have only one number.
So we have a binary tree with 2^(k+1) nodes, and parent will contain a maximum of it children values. To build this tree we will be O(N) time. Tree will consumer O(n) space, first level consumer n, second n/2, third n/4 ... and so on, so total space will be n + n/2 + n/4 + ... = 2n = O(n). E.g. we will have following tree for numbers 1,2,4,6,8,9,12,14:
1 2 4 7 8 9 12 14
2 7 9 14
7 14
14
To delete element we need find it using binary search and mark them null in list. Update tree, if both children NULL put NULL to the node. It will take k operation (the tree height) or O(log(N)).
E.g. we delete 12.
1 2 4 7 8 9 NULL(12) 14
2 7 9 14
7 14
14
Now delete 7.
1 2 4 NUll(7) 8 9 NULL(12) 14
2 4 9 14
4 14
14
Now we should search or predecessor in O(log(N)). By binary search find in first level our element or it predecessor. If it is not NULL we get the answer. If NULL go level up, we have three variants here:
Node has NULL value, go to upper level
Node's left child has a not NULL value. It is smaller then requested number (due tree structure) and it is the answer.
Node's value is bigger the then requested number, go upper.
Related
There is an algorithm interview question:
We have n sorted arrays, how to find the m-th frequent element in the aggregated array of n arrays? Moreover, how to save space? Even compromise on some time complexity.
What I can think is that enumerate all the elements of n arrays and use a hashmap to record their frequency, then sort hashmap respect to the value (frequency). But then there is no difference with the one array case.
Walk over all arrays in parallel using n pointers
1 4 7 12 34
2 6 9 12 25
The walk would look like this
1 1 4 7 7 12 12 34
* 2 2 2 9 12 25 34
You do need a hash map in order to count the number of occurrences of elements in the cut. E.g. at the second step in the example, your cut contains 1 and 2.
Also, you need two min-heaps, one for every cut to be able to choose the array to advance along and another one to store m most repetitive elements.
The complexity would be expected O(#elements * (log(n) + log(m))). The space requirement is O(n + m).
But if you really need to save space you can consider all these n sorted arrays as one big unsorted, sort it with something like heapsort and choose the longest subarray of duplicates. This would require O(#elements * log(#elements)) time but only O(1) space.
You do an n-way merge, but instead of writing out the merged array, you just count the length of each run of duplicate values and remember the longest m in a min-heap.
This takes O(total_length * (log n + log m)) time, and O(n) space.
It's a combination of common SO questions. Search above on "merge k sorted lists" and "kth largest"
Is there a possible way to find the original insertion order of a B+ tree?
I have this tree:
{ [ (1 2) 3 (5 6 7) ] 8 [ (9 10) 11 (12 13) 14 (14 16 17) 18 ( 19 20) ] }
Example tree
No.
For example in your case, the last two inserts could have been 7, 17 or 17, 7 and there is absolutely no way to tell which was which. Indeed of 5, 6, 7 one of the three was inserted after the other two and there is no record of which is which.
This can also be seen immediately from the pigeon hole principle. First let's put an upper bound on how many k-way B-trees there can be with n things in it.
The structure of any b-tree with n things can be encoded as a stream of the sizes of the nodes. From the size of the top node, you know what how many second tier nodes there are, ditto for third tier from second tier. A node can have anywhere from 1..k things in it. There cannot be more nodes than elements. Therefore we can specify a B-tree by first specifying how many nodes there are, then the sizes of the nodes. (Not all sets of numbers will be a B-tree.) For every size s of B-tree, there are k^s <= k^n of them. Therefore n k^n is an upper bound on how many k-way B-trees there can be. Which is exponential growth.
But the number of orders in which elements might be inserted is n!. This function grows strictly faster than exponential growth, and so you cannot recover the order from the B-tree.
Here is the interview problem: Designing a data structure for a range of integers {1,...,M} (numbers can be repeated) support insert(x), delete(x) and return mode which return the most frequently number.
The interviewer said that we can do in O(1) for all the operation with preprocessed in O(M). He also accepted that I can do insert(x) and delete(x) in O(log(n)), return mode in O(1) with preprocessed in O(M).
But I can only give in O(n) for insert(x) and delete(x) and return mode in O(1), actually how can I give O(log (n)) or/and O(1) in insert(x) and delete(x), and return mode in O(1) with preprocessed in O(M)?
When you hear O(log X) operations, the first structures that comes to mind should be a binary search tree and a heap. For reference: (since I'm focussing on a heap below)
A heap is a specialized tree-based data structure that satisfies the heap property: If A is a parent node of B then the key of node A is ordered with respect to the key of node B with the same ordering applying across the heap. ... The keys of parent nodes are always greater than or equal to those of the children and the highest key is in the root node (this kind of heap is called max heap) ....
A binary search tree doesn't allow construction (from unsorted data) in O(M), so let's see if we can make a heap work (you can create a heap in O(M)).
Clearly we want the most frequent number at the top, so this heap needs to use frequency as its ordering.
But this brings us to a problem - insert(x) and delete(x) will both require that we look through the entire heap to find the correct element.
Now you should be thinking "what if we had some sort of mapping from index to position in the tree?", and this is exactly what we're going to have. If all / most of the M elements exist, we could simply have an array, with each index i's element being a pointer to the node in the heap. If implemented correctly, this will allow us to look up the heap node in O(1), which we could then modify appropriately, and move, taking O(log M) for both insert and delete.
If only a few of the M elements exist, replacing the array with a (hash-)map (of integer to heap node) might be a good idea.
Returning the mode will take O(1).
O(1) for all operations is certainly quite a bit more difficult.
The following structure comes to mind:
3 2
^ ^
| |
5 7 4 1
12 14 15 18
To explain what's going on here - 12, 14, 15 and 18 correspond to the frequency, and the numbers above correspond to the elements with said frequency, so both 5 and 3 would have a frequency of 12, 7 and 2 would have a frequency of 14, etc.
This could be implemented as a double linked-list:
/-------\ /-------\
(12) <-> 5 <-> 3 <-> (13) <-> (14) <-> 7 <-> 2 <-> (15) <-> 4 <-> (16) <-> (18) <-> 1
^------------------/ ^------/ ^------------------/ ^------------/ ^------/
You may notice that:
I filled in the missing 13 and 16 - these are necessary, otherwise we'll have to update all elements with the same frequency when doing an insert (in this example, you would've needed to update 5 to point to 13 when doing insert(3), because 13 wouldn't have existed yet, so it would've been pointing to 14).
I skipped 17 - this is just be an optimization in terms of space usage - this makes this structure take O(M) space, as opposed to O(M + MaxFrequency). The exact conditions for skipping a number is simply that it doesn't have any elements at its frequency, or one less than its frequency.
There's some strange things going on above the linked-list. These simply mean that 5 points to 13 as well, and 7 points to 15 as well, i.e. each element also keeps a pointer to the next frequency.
There's some strange things going on below the linked-list. These simply mean that each frequency keeps a pointer to the frequency before it (this is more space efficient than each element keeping a pointer to both it's own and the next frequency).
Similarly to the above solution, we'd keep a mapping (array or map) of integer to node in this structure.
To do an insert:
Look up the node via the mapping.
Remove the node.
Get the pointer to the next frequency, insert it after that node.
Set the next frequency pointer using the element after the insert position (either it is the next frequency, in which case we can just make the pointer point to that, otherwise we can make this next frequency pointer point to the same element as that element's next frequency pointer).
To do a remove:
Look up the node via the mapping.
Remove the node.
Get the pointer to the current frequency via the next frequency, insert it before that node.
Set the next frequency pointer to that node.
To get the mode:
Return the last node.
Since range is fixed, for simplicity lets take an example M=7 (range is 1 to 7). So we need atmost 3 bit to represent each number.
0 - 000
1 - 001
2 - 010
3 - 011
4 - 100
5 - 101
6 - 110
7 - 111
Now create a b-tree with each node having 2-child (like Huffmann coding algo). Each leaf will contain the frequency of each number (initially it would be 0 for all). And address of these nodes will be saved in an array, with key as index (i.e. address for Node 1 will be at index 1 in array).
With pre-processing, we can execute insert, remove in O(1), mode in O(M) time.
insert(x) - go to location k in array, get address of node and increment counter for that node.
delete(x) - as above, just decrement counter if>0.
mode - linear search in array for maximum frequency (value of counter).
This is an interview question I saw online and I am not sure I have correct idea for it.
The problem is here:
Design an algorithm to find the two largest elements in a sequence of n numbers.
Number of comparisons need to be n + O(log n)
I think I might choose quick sort and stop when the two largest elements are find?
But not 100% sure about it. Anyone has idea about it please share
Recursively split the array, find the largest element in each half, then find the largest element that the largest element was ever compared against. That first part requires n compares, the last part requires O(log n). Here is an example:
1 2 5 4 9 7 8 7 5 4 1 0 1 4 2 3
2 5 9 8 5 1 4 3
5 9 5 4
9 5
9
At each step I'm merging adjacent numbers and taking the larger of the two. It takes n compares to get down to the largest number, 9. Then, if we look at every number that 9 was compared against (5, 5, 8, 7), we see that the largest one was 8, which must be the second largest in the array. Since there are O(log n) levels in this, it will take O(log n) compares to do this.
For only 2 largest element, a normal selection may be good enough. it's basically O(2*n).
For a more general "select k elements from an array size n" question, quick Sort is a good thinking, but you don't have to really sort the whole array.
try this
you pick a pivot, split the array to N[m] and N[n-m].
if k < m, forget the N[n-m] part, do step 1 in N[m].
if k > m, forget the N[m] part, do step in in N[n-m]. this time, you try to find the first k-m element in the N[n-m].
if k = m, you got it.
It's basically like locate k in an array N. you need log(N) iteration, and move (N/2)^i elements in average. so it's a N + log(N) algorithm (which meets your requirement), and has very good practical performance (faster than plain quick sort, since it avoid any sorting, so the output is not ordered).
If I have a sorted set of data, which I want to store on disk in a way that is optimal for both reading sequentially and doing random lookups on, it seems that a B-Tree (or one of the variants is a good choice ... presuming this data-set does not all fit in RAM).
The question is can a full B-Tree be constructed from a sorted set of data without doing any page splits? So that the sorted data can be written to disk sequentially.
Constructing a "B+ tree" to those specifications is simple.
Choose your branching factor k.
Write the sorted data to a file. This is the leaf level.
To construct the next highest level, scan the current level and write out every kth item.
Stop when the current level has k items or fewer.
Example with k = 2:
0 1|2 3|4 5|6 7|8 9
0 2 |4 6 |8
0 4 |8
0 8
Now let's look for 5. Use binary search to find the last number less than or equal to 5 in the top level, or 0. Look at the interval in the next lowest level corresponding to 0:
0 4
Now 4:
4 6
Now 4 again:
4 5
Found it. In general, the jth item corresponds to items jk though (j+1)k-1 at the next level. You can also scan the leaf level linearly.
We can make a B-tree in one pass, but it may not be the optimal storage method. Depending on how often you make sequential queries vs random access ones, it may be better to store it in sequence and use binary search to service a random access query.
That said: assume that each record in your b-tree holds (m - 1) keys (m > 2, the binary case is a bit different). We want all the leaves on the same level and all the internal nodes to have at least (m - 1) / 2 keys. We know that a full b-tree of height k has (m^k - 1) keys. Assume that we have n keys total to store. Let k be the smallest integer such that m^k - 1 > n. Now if 2 m^(k - 1) - 1 < n we can completely fill up the inner nodes, and distribute the rest of the keys evenly to the leaf nodes, each leaf node getting either the floor or ceiling of (n + 1 - m^(k - 1))/m^(k - 1) keys. If we cannot do that then we know that we have enough to fill all of the nodes at depth k - 1 at least halfway and store one key in each of the leaves.
Once we have decided the shape of our tree, we need only do an inorder traversal of the tree sequentially dropping keys into position as we go.
Optimal meaning that an inorder traversal of the data will always be seeking forward through the file (or mmaped region), and a random lookup is done in a minimal number of seeks.