I am attempting to solve the following assignment:
I'm given an array of n elements. it is known that not all keys of the array are distinct, but it is given that we have k distinct elements (k<=n of course).
the assignment is to do a stable sort of the array in O(n log(log n)) worst case while k=O(log n). I'm allowed to use O(n) extra memory.
My current solution is described below:
Create a hash table with chaining of size k that does the following:
if the hash function tries to insert an element to a place that already has a value in it - it checks if they are equal - if they are it adds it to the list, if not it starts moving in the array until it finds a place that has that same value or an empty space(which ever comes first).
This way the lists in each place only contains elements with equal keys. the insertion to the hashtable is from start to finish in the original array so each list is stably sorted.
Then sort the array in the hash table with mergeSort (for lists we treat the first element as just one and move it).
After we are done merge sorting we copy the elements back to the original array by order and whenever we meet a list we copy each element by order.
Here is what I'm not sure about:
Is it true to say that because the hash table is size k and we only have k different elements, uniform hashing promises that the amount of time the hash function will try to give different values the same place in the array is negligible and therefore it's build time complexity is O(n)?
Because if so it seems the algorithm's runtime is O(n + k log k) = O(n + log n*log(log n)).
which is definitely better then O(n log k) which what was required.
I think you're on the right track with the hash table, but I don't think you should insert the elements in the hash table, then copy them out again. You should use the hash table only to count the number of elements for each distinct value.
Next you compute the starting index for each distinct value, by traversing all values in order and adding the previous element's count to its start index:
start index for element i = start index for element i-1 + count for element i-1.
This step requires sorting the k elements in the hash table, which amounts to O(k log k) = O(log n log log n) operations, much less than the O(n) for steps 1 and 3.
Finally, you traverse your input array again, look it up in the table, and find the location in the output array for it. You copy the element, and also increase the start index for elements of its value, so that the next element will be copied after it.
If the comparison values for the items are consecutive integers (or integers in some small range), then you can use an array instead of a hash table for counting.
This is called counting sort.
on the same note as here:
You create a binary tree, any tree:
each node would be a list, of elements, with a distinct key.
now we iterate the array, and for each key we search for it in the tree,
the search would take log(log(n)), as there is a maximum of log(n) distinct nodes in the tree. ( if the key doesnt exist we just add it as a node to the tree ).
so iterating the array would take O(n*log(log(n))) as we gave n elements.
finally, as this is a binary search tree, we can call in order,
and would get the sorted order of the arrays.
and all that left is to combine them into single array.
that would take O(n) time.
so we get O(n + nlog(log(n))=O(nlog(log(n)))
Related
My approach was that log n means that we make a tree by dividing the array into 2 sets till we have all the elements separated then we will se if each element is in the array or not.
In this approach the time complexity should be O(n log n) because the last element could be in the last leaf.
What am I missing?
I found a solution but I didn't understand it.
The solution you show is not for the problem you are trying to solve. In their context they can test if some group of cities is contaminated by what they call an "experiment" and want to know how few experiments they need to find exactly which k cities are contaminated. This is a theoretical situation and is most likely meant only as an exercise with no real life application.
In reality, you can't test if an array contains an element in less than O(n) time in the worst case, unless this array has some structure to it (for example being sorted). In particular, you can't test if an array contains k elements in O(k log n) time for k = o(n / log(n)), because in this case k log n = o(n). This of course also means that it is impossible for constant k.
(But if k is of the same order as n then O(k log n) becomes O(n log n) and then you can resort to sorting your array).
If your array is sorted then you can perform k binary searches and achieve this runtime.
Binary search uses that approach:
it takes an ascendent ordered array as input
goes to the middle element
if that's the element you are looking for, that's it
else :
middle element < the element searched -> to the same thing using as input array the right half of the array
middle element > the element searched -> to the same thing using as input array the left half of the array
Continue until or the element is been found, or the array size is 1
If the array is not sorted, there is no way you can find it in k*log n, at least O(n) just checking every element with the desired ones
I am solving a problem but I got stuck on this part.
There are 3 types of query: add a element (integer), remove a element, get sum of n (n can be any integer) largest elements. How can I do this efficient ? I am current use this solution: add a element , remove a element (binary search, O(lg n) ). getSum (naive, O(n) ).
A segment tree is commonly used to find the sum of a given range. Building that on top of a binary search tree should get the data structure you are looking for with O(log N) adds, remove and sum given range. By querying sum over the range where the k-largest elements are (roughly N-k to N), you can get the sum of the k-largest elements in O(log N). The result being a mutable ordered segment tree rather than the standard immutable (static) unordered one.
Basically, you just add variables to hold the number of children and the sum of their values to each parent node and use that information to find the sum via O(log N) additions and/or subtractions.
If k is fixed, you can use the same approach that allows for O(1) find-min/max in heaps to allow for O(1) find the k-largest elements sum simply by updating a variable holding the value during each O(log N) add/remove.
A lot depends on the relative frequency of the queries but if we assume a typical situation where the sum query will be much more frequent than the add-remove requests (and add is more frequent than remove), the solution is to store a tuple of the sums and the numbers.
So the first element will be (a1, a1), the second element in your list will be (a2, a1+a2) and so on. (Note that when you insert a new element in the k-th position, you still don't need to do the whole sum, just add the new number to the preceding element's sum.)
Removals will be quite expensive though but that's the trade-off for an O(1) sum query.
I've been asked to devise a data structure called clever-list which holds items with real key numbers and offers the next operations:
Insert(x) - inserts a new element to the list. Should be in O(log n).
Remove min/max - removes and returns the min/max element in the list. Should be in O(log n) time.
Transform - changes the return object of remove min/max (if was min then to max, and the opposite). Should be in O(1).
Random sample(k) - returns randomly selected k elements from the list(k bigger than 0 and smaller than n). Should be in O(min(k log k, n + (n-k) log (n-k))).
Assumptions about the structure:
The data structure won't hold more then 3n elements at any stage.
We cannot assume that n=O(1).
We can use Random() method which return a real number between [0,1) and preforms in O(1) time.
I managed to implement the first three methods, using a min-max fine heap. However, I don't have a clue about the random sample(k) method in this time limit. All I could find is "Reservoir sampling", which operates in O(n) time.
Any suggestions?
You can do all of that with a min-max heap implemented in an array, including the random sampling.
For the random sampling, pick a random number from 0 to n. That's the index of the item you want to remove. Copy that item and then replace the item at that index with the last item in the array, and reduce the count. Now, either bubble that item up or sift it down as required.
If it's on a min level and the item is smaller than its parent, then bubble it up. If it's larger than its smallest child, sift it down. If it's on a max level, you reverse the logic.
That random sampling is O(k log n). That is, you'll remove k items from a heap of n items. It's the same complexity as k calls to delete-min.
Additional info
If you don't have to remove the items from the list, then you can do a naive random sampling in O(k) by selecting k indexes from the array. However, there is a chance of duplicates. To avoid duplicates, you can do this:
When you select an item at random, swap it with the last item in the array and reduce the count by 1. When you've selected all the items, they're in the last k positions of the array. This is clearly an O(k) operation. You can copy those items to be returned by the function. Then, set count back to the original value and call your MakeHeap function, which can build a heap from an arbitrary array in O(n). So your operation is O(k + n).
The MakeHeap function is pretty simple:
for (int i = count/2; i >= 0; --i)
{
SiftDown(i);
}
Another option would be, when you do a swap, to save the swap operation on a stack. That is, save the from and to indexes. To put the items back, just run the swaps in reverse order (i.e. pop from the stack, swap the items, and continue until the stack is empty). That's O(k) for the selection, O(k) for putting it back, and O(k) extra space for the stack.
Another way to do it, of course, is to do the removals as I suggested, and once all the removals are done you re-insert the items into the heap. That's O(k log n) to remove and O(k log n) to add.
You could, by the way, do the random sampling in O(k) best case by using a hash table to hold the randomly selected indexes. You just generate random indexes and add them to the hash table (which won't accept duplicates) until the hash table contains k items. The problem with that approach is that, at least in theory, the algorithm could fail to terminate.
If you store the numbers in an array, and use a self-balancing binary tree to maintain a sorted index of them, then you can do all the operations with the time complexities given. In the nodes of the tree, you'll need pointers into the number array, and in the array you'll need a pointer back into the node of the tree where that number belongs.
Insert(x) adds x to the end of the array, and then inserts it into the binary tree.
Remove min/max follows the left/right branches of the binary tree to find the min or max, then removes it. You need to swap the last number in the array into the hole produced by the removal. This is when you need the back pointers from the array back into the tree.
Transform toggles a bit for the remove min/max operation
Random sample either picks k or (n-k) unique ints in the range 0...n-1 (depending whether 2k < n). The random sample is either the elements at the k locations in the number array, or it's the elements at all but the (n-k) locations in the number array.
Creating a set of k unique ints in the range 0..n can be done in O(k) time, assuming that (uninitialized) memory can be allocated in O(1) time.
First, assume that you have a way of knowing if memory is uninitialized or not. Then, you could have an uninitialized array of size n, and do the usual k-steps of a Fisher-Yates shuffle, except every time you access an element of the array (say, index i), if it's uninitialized, then you can initialize it to value i. This avoids initializing the entire array which allows the shuffle to be done in O(k) time rather than O(n) time.
Second, obviously it's not possible in general to know if memory is uninitialized or not, but there's a trick you can use (at the cost of doubling the amount of memory used) that lets you implement a sparse array in uninitialized memory. It's described in depth on Russ Cox's blog here: http://research.swtch.com/sparse
This gives you an O(k) way of randomly selecting k numbers. If k is large (ie: > n/2) you can do the selection of (n-k) numbers instead of k numbers, but you still need to return the non-selected numbers to the user, which is always going to be O(k) if you copy them out, so the faster selection gains you nothing.
A simpler approach, if you don't mind giving out access to your internal data-structure, is to do k or n-k steps of the Fisher-Yates shuffle on the underlying array (depending whether k < n/2, and being careful to update the corresponding nodes in the tree to maintain their values), and then return either a[0..k-1] or a[k..n-1]. In this case, the returned value will only be valid until the next operation on the datastructure. This method is O(min(k, n-k)).
It was a recent interview question. Please design a data structure with insertion, deletion, get random in o(1) time complexity, the data structure can be a basic data structures such as arrays, can be a modification of basic data structures, and can be a combination of basic data structures.
Combine an array with a hash-map of element to array index.
Insertion can be done by appending to the array and adding to the hash-map.
Deletion can be done by first looking up and removing the array index in the hash-map, then swapping the last element with that element in the array, updating the previously last element's index appropriately, and decreasing the array size by one (removing the last element).
Get random can be done by returning a random index from the array.
All operations take O(1).
Well, in reality, it's amortised (from resizing the array) expected (from expected hash collisions) O(1), but close enough.
A radix tree would work. See http://en.wikipedia.org/wiki/Radix_tree. Insertion and deletion are O(k) where k is the maximum length of the keys. If all the keys are the same length (e.g., all pointers), then k is a constant so the running time is O(1).
In order to implement get random, maintain a record of the total number of leaves in each subtree (O(k)). The total number of leaves in tree is recorded at the root. To pick one at random, generate a random integer to represent the index of the element to pick. Recursively scan down the tree, always following the branch that contains the element you picked. You always know which branch to choose because you know how many leaves can be reached from each subtree. The height of the tree is no more than k, so this is O(k), or O(1) when k is constant.
The problem at hand is whats in the title itself. That is to give an algorithm which sorts an n element array with O(logn) distinct elements in O(nloglogn) worst case time. Any ideas?
Further how do you generally handle arrays with multiple non distinct elements?
O(log(log(n))) time is enough for you to do a primitive operation in a search tree with O(log(n)) elements.
Thus, maintain a balanced search tree of all the distinct elements you have seen so far. Each node in the tree additionally contains a list of all elements you have seen with that key.
Walk through the input elements one by one. For each element, try to insert it into the tree (which takes O(log log n) time). If you find you've already seen an equal element, just insert it into the auxiliary list in the already-existing node.
After traversing the entire list, walk through the tree in order, concatenating the auxiliary lists. (If you take care to insert in the auxiliary lists at the right ends, this is even a stable sort).
Simple log(N) space solution would be:
find distinct elements using balanced tree (log(n) space, n+log(n) == n time)
Than you can use this this tree to allways pick correct pivot for quicksort.
I wonder if there is log(log(N)) space solution.
Some details about using a tree:
You should be able to use a red black tree (or other type of tree based sorting algorithm) using nodes that hold both a value and a counter: maybe a tuple (n, count).
When you insert a new value you either create a new node or you increment the count of the node with the value you are adding (if a node with that value already exists). If you just increment the counter it will take you O(logH) where H is the height of the tree (to find the node), if you need to create it it will also take O(logH) to create and position the node (the constants are bigger, but it's still O(logH).
This will ensure that the tree will have no more than O(logn) values (because you have log n distinct values). This means that the insertion will take O(loglogn) and you have n insertions, so O(nloglogn).