Implementing a smart-list - algorithm

I've been asked to devise a data structure called clever-list which holds items with real key numbers and offers the next operations:
Insert(x) - inserts a new element to the list. Should be in O(log n).
Remove min/max - removes and returns the min/max element in the list. Should be in O(log n) time.
Transform - changes the return object of remove min/max (if was min then to max, and the opposite). Should be in O(1).
Random sample(k) - returns randomly selected k elements from the list(k bigger than 0 and smaller than n). Should be in O(min(k log k, n + (n-k) log (n-k))).
Assumptions about the structure:
The data structure won't hold more then 3n elements at any stage.
We cannot assume that n=O(1).
We can use Random() method which return a real number between [0,1) and preforms in O(1) time.
I managed to implement the first three methods, using a min-max fine heap. However, I don't have a clue about the random sample(k) method in this time limit. All I could find is "Reservoir sampling", which operates in O(n) time.
Any suggestions?

You can do all of that with a min-max heap implemented in an array, including the random sampling.
For the random sampling, pick a random number from 0 to n. That's the index of the item you want to remove. Copy that item and then replace the item at that index with the last item in the array, and reduce the count. Now, either bubble that item up or sift it down as required.
If it's on a min level and the item is smaller than its parent, then bubble it up. If it's larger than its smallest child, sift it down. If it's on a max level, you reverse the logic.
That random sampling is O(k log n). That is, you'll remove k items from a heap of n items. It's the same complexity as k calls to delete-min.
Additional info
If you don't have to remove the items from the list, then you can do a naive random sampling in O(k) by selecting k indexes from the array. However, there is a chance of duplicates. To avoid duplicates, you can do this:
When you select an item at random, swap it with the last item in the array and reduce the count by 1. When you've selected all the items, they're in the last k positions of the array. This is clearly an O(k) operation. You can copy those items to be returned by the function. Then, set count back to the original value and call your MakeHeap function, which can build a heap from an arbitrary array in O(n). So your operation is O(k + n).
The MakeHeap function is pretty simple:
for (int i = count/2; i >= 0; --i)
{
SiftDown(i);
}
Another option would be, when you do a swap, to save the swap operation on a stack. That is, save the from and to indexes. To put the items back, just run the swaps in reverse order (i.e. pop from the stack, swap the items, and continue until the stack is empty). That's O(k) for the selection, O(k) for putting it back, and O(k) extra space for the stack.
Another way to do it, of course, is to do the removals as I suggested, and once all the removals are done you re-insert the items into the heap. That's O(k log n) to remove and O(k log n) to add.
You could, by the way, do the random sampling in O(k) best case by using a hash table to hold the randomly selected indexes. You just generate random indexes and add them to the hash table (which won't accept duplicates) until the hash table contains k items. The problem with that approach is that, at least in theory, the algorithm could fail to terminate.

If you store the numbers in an array, and use a self-balancing binary tree to maintain a sorted index of them, then you can do all the operations with the time complexities given. In the nodes of the tree, you'll need pointers into the number array, and in the array you'll need a pointer back into the node of the tree where that number belongs.
Insert(x) adds x to the end of the array, and then inserts it into the binary tree.
Remove min/max follows the left/right branches of the binary tree to find the min or max, then removes it. You need to swap the last number in the array into the hole produced by the removal. This is when you need the back pointers from the array back into the tree.
Transform toggles a bit for the remove min/max operation
Random sample either picks k or (n-k) unique ints in the range 0...n-1 (depending whether 2k < n). The random sample is either the elements at the k locations in the number array, or it's the elements at all but the (n-k) locations in the number array.
Creating a set of k unique ints in the range 0..n can be done in O(k) time, assuming that (uninitialized) memory can be allocated in O(1) time.
First, assume that you have a way of knowing if memory is uninitialized or not. Then, you could have an uninitialized array of size n, and do the usual k-steps of a Fisher-Yates shuffle, except every time you access an element of the array (say, index i), if it's uninitialized, then you can initialize it to value i. This avoids initializing the entire array which allows the shuffle to be done in O(k) time rather than O(n) time.
Second, obviously it's not possible in general to know if memory is uninitialized or not, but there's a trick you can use (at the cost of doubling the amount of memory used) that lets you implement a sparse array in uninitialized memory. It's described in depth on Russ Cox's blog here: http://research.swtch.com/sparse
This gives you an O(k) way of randomly selecting k numbers. If k is large (ie: > n/2) you can do the selection of (n-k) numbers instead of k numbers, but you still need to return the non-selected numbers to the user, which is always going to be O(k) if you copy them out, so the faster selection gains you nothing.
A simpler approach, if you don't mind giving out access to your internal data-structure, is to do k or n-k steps of the Fisher-Yates shuffle on the underlying array (depending whether k < n/2, and being careful to update the corresponding nodes in the tree to maintain their values), and then return either a[0..k-1] or a[k..n-1]. In this case, the returned value will only be valid until the next operation on the datastructure. This method is O(min(k, n-k)).

Related

Which algorithm should use when get the biggest n elements from a large list?

In my project, there is a very large list.
The most common operation on this list is to get the biggest n elements.
The n is fixed or rarely changed throughout the whole lifetime. Which algorithm should I use in order to do this efficiently?
It means that what should I do when inserting, updating, or deleting an element in the list, and what should I do when getting top n elements from the list.
There is a solution (maybe not that good):
After inserting, updating, or deleting an element, sort the list with quicksort or another sort algorithm. Because the list is too large, this step maybe too slow.
When getting top n elements, get the first n elements from the list.
Is there any better solution?
So you have a list of n items, and you want to pick the k largest. One way to do this is with a min-heap of size k. The resulting algorithm is O(n log k).
Start by creating an empty min-heap of the first k items. Then, for each following item in the list, if it's larger than the smallest item on the heap, remove the smallest item on the heap and replace it with the new item. When you're done, the largest k items will be on the heap. Pseudo code looks like this:
// assume an array a[], with length n.
// k is the number of largest items you want.
heap = new min-heap
// add first k items to the heap
for (i = 0; i < k; ++i)
heap.add(a[i])
for (i = k; i < n; ++i)
if (a[i] > heap.peek())
heap.removeMin()
heap.add(a[i])
// at this point, the largest k items are on the min-heap
This technique works well when k is a small percentage of n. In that case, it requires little memory. The algorithm has a worst case running time of O(n log k), but it's highly dependent on the order of items in the list. The worst case is when the array is sorted in ascending order. The best case is when the array is sorted in descending order. In the average case, much fewer than 50% of items get added and removed from the heap.
Another algorithm, Quickselect, has complexity O(n), but is slower than the heap selection method when k is a small percentage (1 or 2%) of n. Quickselect also modifies the existing list, which might not be something you want.
See my blog post, https://blog.mischel.com/2011/10/25/when-theory-meets-practice/, for more details.
You can do a few things here to speed up your average time by maintaining a heap rather than rebuilding it for every query.
If the number of items you want will always be less than 30, then maintain a heap of 30 items all the time. If the user just wants the top 10, then you can pick just those from the heap.
When an item is added to the list, check to see if it is larger than the smallest item on the heap. If it is, replace the smallest item.
When an item is deleted, mark the heap as dirty.
When asked for the top k items, if the heap is dirty then you have to rebuild it. Otherwise, you can just copy the contents of the heap to a scratch array, sort it, and return the k items that were asked for. Of course, once you rebuild the heap, clear the dirty flag.
The result, then, is that you can maintain the heap at little cost: potentially updating it whenever a new item is added, but only if it is larger than one of the top 30 (or whatever your max is) items. The only time you have to rebuild is when asked for the top k items after a deletion.
Come to think of it, you only have to mark the heap as dirty if the item you delete is greater than or equal to the smallest item on the heap. Also, if the heap is marked as dirty, then you can forego any further update on insertion or deletion because you have to rebuild the heap anyway the next time you get a query.
A (balanced) binary search tree is your best friend. Insertions, deletions, search for the k-th all in time O(Log N).
If the data resides in external memory, then a B-tree or similar.
if n <<<< size(list) then use a hashtable for the main elements, and a companion data structure to store the n biggest elements. The companion data structure is updated during insertion and deletion, and its used to query the biggest elements.
If n is 30, a sorted array is sufficient.
Disclaimer : This approach perform poorly if biggest elements are often removed. Deletion of a biggest element would requires a sequential scan of the whole hashtable.
in C++ STL.
Your best bet is to used an std::set.
Every time you add an element it will be ordered.
Then you can extract the n last element of std::set

Sorting an array in O(n log(log n))

I am attempting to solve the following assignment:
I'm given an array of n elements. it is known that not all keys of the array are distinct, but it is given that we have k distinct elements (k<=n of course).
the assignment is to do a stable sort of the array in O(n log(log n)) worst case while k=O(log n). I'm allowed to use O(n) extra memory.
My current solution is described below:
Create a hash table with chaining of size k that does the following:
if the hash function tries to insert an element to a place that already has a value in it - it checks if they are equal - if they are it adds it to the list, if not it starts moving in the array until it finds a place that has that same value or an empty space(which ever comes first).
This way the lists in each place only contains elements with equal keys. the insertion to the hashtable is from start to finish in the original array so each list is stably sorted.
Then sort the array in the hash table with mergeSort (for lists we treat the first element as just one and move it).
After we are done merge sorting we copy the elements back to the original array by order and whenever we meet a list we copy each element by order.
Here is what I'm not sure about:
Is it true to say that because the hash table is size k and we only have k different elements, uniform hashing promises that the amount of time the hash function will try to give different values the same place in the array is negligible and therefore it's build time complexity is O(n)?
Because if so it seems the algorithm's runtime is O(n + k log k) = O(n + log n*log(log n)).
which is definitely better then O(n log k) which what was required.
I think you're on the right track with the hash table, but I don't think you should insert the elements in the hash table, then copy them out again. You should use the hash table only to count the number of elements for each distinct value.
Next you compute the starting index for each distinct value, by traversing all values in order and adding the previous element's count to its start index:
start index for element i = start index for element i-1 + count for element i-1.
This step requires sorting the k elements in the hash table, which amounts to O(k log k) = O(log n log log n) operations, much less than the O(n) for steps 1 and 3.
Finally, you traverse your input array again, look it up in the table, and find the location in the output array for it. You copy the element, and also increase the start index for elements of its value, so that the next element will be copied after it.
If the comparison values for the items are consecutive integers (or integers in some small range), then you can use an array instead of a hash table for counting.
This is called counting sort.
on the same note as here:
You create a binary tree, any tree:
each node would be a list, of elements, with a distinct key.
now we iterate the array, and for each key we search for it in the tree,
the search would take log(log(n)), as there is a maximum of log(n) distinct nodes in the tree. ( if the key doesnt exist we just add it as a node to the tree ).
so iterating the array would take O(n*log(log(n))) as we gave n elements.
finally, as this is a binary search tree, we can call in order,
and would get the sorted order of the arrays.
and all that left is to combine them into single array.
that would take O(n) time.
so we get O(n + nlog(log(n))=O(nlog(log(n)))

Partial sorting to find the kth largest/smallest elements

Source : Wikipedia
A streaming, single-pass partial sort is also possible using heaps or
other priority queue data structures. First, insert the first k
elements of the input into the structure. Then make one pass over the
remaining elements, add each to the structure in turn, and remove the
largest element. Each insertion operation also takes O(log k) time,
resulting in O(n log k) time overall.
How is this different from the case where we first heapify the complete input array in O(n) time and then extract the minimum of the heap k times.
I don't understand the part where it says to make one pass over the remaining elements, add each to the structure in turn, and remove the largest element. Isn't this the same as the method described in 1) ?
The suggested method is streaming. It doesn't need to have all the items in memory to run the heapify algorithm, given it O(k) space complexity (but it only finds the top-k items).
A more explicit description of the algorithm (see also the reference WP gives) is
given a stream of items:
make a heap of the first k elements in the stream,
for each element after the first k:
push it onto the heap,
extract the largest (or smallest) element from the heap and discard it,
finally return the k values left in the heap.
By construction, the heap never grows to more than k + 1 elements. The items can be streamed in from disk, over a network, etc., which is not possible with the heapify algorithm.

Design a data structure with insertion, deletion, get random in O(1) time complexity,

It was a recent interview question. Please design a data structure with insertion, deletion, get random in o(1) time complexity, the data structure can be a basic data structures such as arrays, can be a modification of basic data structures, and can be a combination of basic data structures.
Combine an array with a hash-map of element to array index.
Insertion can be done by appending to the array and adding to the hash-map.
Deletion can be done by first looking up and removing the array index in the hash-map, then swapping the last element with that element in the array, updating the previously last element's index appropriately, and decreasing the array size by one (removing the last element).
Get random can be done by returning a random index from the array.
All operations take O(1).
Well, in reality, it's amortised (from resizing the array) expected (from expected hash collisions) O(1), but close enough.
A radix tree would work. See http://en.wikipedia.org/wiki/Radix_tree. Insertion and deletion are O(k) where k is the maximum length of the keys. If all the keys are the same length (e.g., all pointers), then k is a constant so the running time is O(1).
In order to implement get random, maintain a record of the total number of leaves in each subtree (O(k)). The total number of leaves in tree is recorded at the root. To pick one at random, generate a random integer to represent the index of the element to pick. Recursively scan down the tree, always following the branch that contains the element you picked. You always know which branch to choose because you know how many leaves can be reached from each subtree. The height of the tree is no more than k, so this is O(k), or O(1) when k is constant.

Looking for a data container with O(1) indexing and O(log(n)) insertion and deletion

I'm not sure if it's possible but it seems a little bit reasonable to me, I'm looking for a data structure which allows me to do these operations:
insert an item with O(log n)
remove an item with O(log n)
find/edit the k'th-smallest element in O(1), for arbitrary k (O(1) indexing)
of course editing won't result in any change in the order of elements. and what makes it somehow possible is I'm going to insert elements one by one in increasing order. So if for example I try inserting for the fifth time, I'm sure all four elements before this one are smaller than it and all the elements after this this are going to be larger.
I don't know if the requested time complexities are possible for such a data container. But here is a couple of approaches, which almost achieve these complexities.
First one is tiered vector with O(1) insertion and indexing, but O(sqrt N) deletion. Since you expect only about 10000 elements in this container and sqrt(10000)/log(10000) = 7, you get almost the required performance here. Tiered vector is implemented as an array of ring-buffers, so deleting an element requires moving all elements, following it in the ring-buffer, and moving one element from each of the following ring-buffers to the one, preceding it; indexing in this container means indexing in the array of ring-buffers and then indexing inside the ring-buffer.
It is possible to create a different container, very similar to tiered vector, having exactly the same complexities, but working a little bit faster because it is more cache-friendly. Allocate a N-element array to store the values. And allocate a sqrt(N)-element array to store index corrections (initialized with zeros). I'll show how it works on the example of 100-element container. To delete element with index 56, move elements 57..60 to positions 56..59, then in the array of index corrections add 1 to elements 6..9. To find 84-th element, look up eighth element in the array of index corrections (its value is 1), then add its value to the index (84+1=85), then take 85-th element from the main array. After about half of elements in main array are deleted, it is necessary to compact the whole container to attain contiguous storage. This gets only O(1) cumulative complexity. For real-time applications this operation may be performed in several smaller steps.
This approach may be extended to a Trie of depth M, taking O(M) time for indexing, O(M*N1/M) time for deletion and O(1) time for insertion. Just allocate a N-element array to store the values, N(M-1)/M, N(M-2)/M, ..., N1/M-element arrays to store index corrections. To delete element 2345, move 4 elements in main array, increase 5 elements in the largest "corrections" array, increase 6 elements in the next one and 7 elements in the last one. To get element 5678 from this container, add to 5678 all corrections in elements 5, 56, 567 and use the result to index the main array. Choosing different values for 'M', you can balance the complexity between indexing and deletion operations. For example, for N=65000 you can choose M=4; so indexing requires only 4 memory accesses and deletion updates 4*16=64 memory locations.
I wanted to point out first that if k is really a random number, then it might be worth considering that the problem might be completely different: asking for the k-th smallest element, with k uniformly at random in the range of the available elements is basically... picking an element at random. And it can be done much differently.
Here I'm assuming you actually need to select for some specific, if arbitrary, k.
Given your strong pre-condition that your elements are inserted in order, there is a simple solution:
Since your elements are given in order, just add them one by one to an array; that is you have some (infinite) table T, and a cursor c, initially c := 1, when adding an element, do T[c] := x and c := c+1.
When you want to access the k-th smallest element, just look at T[k].
The problem, of course, is that as you delete elements, you create gaps in the table, such that element T[k] might not be the k-th smallest, but the j-th smallest with j <= k, because some cells before k are empty.
It then is enough to keep track of the elements which you have deleted, to know how many have been deleted that are smaller than k. How do you do this in time at most O(log n)? By using a range tree or a similar type of data structure. A range tree is a structure that lets you add integers and then query for all integers in between X and Y. So, whenever you delete an item, simply add it to the range tree; and when you are looking for the k-th smallest element, make a query for all integers between 0 and k that have been deleted; say that delta have been deleted, then the k-th element would be in T[k+delta].
There are two slight catches, which require some fixing:
The range tree returns the range in time O(log n), but to count the number of elements in the range, you must walk through each element in the range and so this adds a time O(D) where D is the number of deleted items in the range; to get rid of this, you must modify the range tree structure so as to keep track, at each node, of the number of distinct elements in the subtree. Maintaining this count will only cost O(log n) which doesn't impact the overall complexity, and it's a fairly trivial modification to do.
In truth, making just one query will not work. Indeed, if you get delta deleted elements in range 1 to k, then you need to make sure that there are no elements deleted in range k+1 to k+delta, and so on. The full algorithm would be something along the line of what is below.
KthSmallest(T,k) := {
a = 1; b = k; delta
do {
delta = deletedInRange(a, b)
a = b + 1
b = b + delta
while( delta > 0 )
return T[b]
}
The exact complexity of this operation depends on how exactly you make your deletions, but if your elements are deleted uniformly at random, then the number of iterations should be fairly small.
There is a Treelist (implementation for Java, with source code), which is O(lg n) for all three ops (insert, delete, index).
Actually, the accepted name for this data structure seems to be "order statistic tree". (Apart from indexing, it's also defined to support indexof(element) in O(lg n).)
By the way, O(1) is not considered much of an advantage over O(lg n). Such differences tend to be overwhelmed by the constant factor in practice. (Are you going to have 1e18 items in the data structure? If we set that as an upper bound, that's just equivalent to a constant factor of 60 or so.)
Look into heaps. Insert and removal should be O(log n) and peeking of the smallest element is O(1). Peeking or retrieval of the K'th element, however, will be O(log n) again.
EDITED: as amit stated, retrieval is more expensive than just peeking
This is probably not possible.
However, you can make certain changes in balanced binary trees to get kth element in O(log n).
Read more about it here : Wikipedia.
Indexible Skip lists might be able to do (close) what you want:
http://en.wikipedia.org/wiki/Skip_lists#Indexable_skiplist
However, there's a few caveats:
It's a probabilistic data structure. That means it's not necessarily going to be O(log N) for all operations
It's not going to be O(1) for indexing, just O(log N)
Depending on the speed of your RNG and also depending on how slow traversing pointers are, you'll likely get worse performance from this than just sticking with an array and dealing with the higher cost of removals.
Most likely, something along the lines of this is going to be the "best" you can do to achieve your goals.

Resources