What is the optimal algorithm for selecting top n elements from multiple arrays, provided each array is sorted the same way in which the resultant array should be.
Reading elements is very expensive and therefore the number of reads should be an absolute minimum.
Put tuples (current_element, array_number, current_index=0) into priority queue (for example, based on binary max-heap), ordered by element value
Then remove top of the queue n times.
After removing increment index in corresponding array (if possible), get the next element and insert updated tuple into queue again
Does there exist a data structure with the following properties:
Elements are stored in some order
Accessing the element at a given index takes O(1) time (possibly amortized)
Removing an element takes amortized O(1) time, and changes the indices appropriately (so if element 0 is removed, the next access to element 0 should return the old element 1)
For context, I reduced an algorithm question from a programming competition to:
Over m queries, return the kth smallest positive number that hasn't been returned yet. You can assume the returned number is less than some constant n.
If the data structure above exists, then you can do this in O(m) time, by creating a list of numbers 1 to n. Then, for each query, find the element at index k and remove it. During the contest itself, my solution ended up being O(m^2) on certain inputs.
I'm pretty sure you can do this in O(m log m) with binary search trees, but I'm wondering if the ideal O(m) is reachable. Stuff I've found online tends to be close, but not quite there - the tricky part is that the elements you remove can be from anywhere in the list.
well the O(1) removal is possible with linked list
each element has pointer to next and previous element so removal just deletes element and sets the pointers of its neighbors like:
element[ix-1].next=element[ix+1].prev
accessing ordered elements at index in O(1) can be done with indexed arrays
so you have unordered array like dat[] and index array like idx[] the access of element ix is just:
dat[idx[ix]]
Now the problem is to have these properties at once
you can try to have linked list with index array but the removal needs to update index table which is O(N) in the worst case.
if you have just index array then the removal is also O(N)
if you have the index in some form of a tree structure then the removal can be close to O(log(N)) but the access will be also about O(log(N))
I believe there is a structure that would do both of this in O(n) time, where n was the number of points which had been removed, and not the total size. So if the number you're removing is small compared to the size of the array, it's close to O(1).
Basically, all the data is stored in an array. There is also a priority queue for deleted elements. Initialise like so:
Data = [0, 1, 2, ..., m]
removed = new list
Then, to remove an element, you add it's original index (see below for how to get this) to the priority queue (which is sorted by size of element with smallest at the front), and leave the array as is. So removing the 3rd element:
Data = [0, 1, 2, 3,..., m]
removed = 2
Then what's now the 4th and was the 5th:
Data = [0, 1, 2, 3,..., m]
removed = 2 -> 4
Then what's now the 3rd and was the 4th:
Data = [0, 1, 2, 3,..., m]
removed = 2 -> 3 -> 4
Now to access an element, you start with it's index. You then iterate along the removed list, increasing the index by one each time, until you reach an element which is larger than the increased value of the index. This will give you the original index(ie. position in Data) of the element you're looking for, and is the index you needed for removal.
This operation of iterating along the queue effectively increases the index by the number of elements before it that were removed.
Sorry if I haven't explained very well, it was clear in my head but hard to write down.
Comments:
Access is O(n), with n number of removed items
Removal is approximately twice the time of access, but still O(n)
A disadvantage is that memory use doesn't shrink with removal.
Could potentially 're-initialise' when removed list is large to reset memory use and access and removal times. This operation takes O(N), with N total array size.
So it's not quite what OP was looking for but in the right situation could be close.
The traditional Heapsort algorithm swaps the last element of the heap with the root of the current heap after every heapification, then continues the process again. However, I noticed that it is kind of unnecessary.
After a heapification of a sub-array, while the node contains the highest value (if it's a max-heap), the next 2 elements in the array must follow the root in the sorted array, either in the same order as they are now, or exchanging them if they are reverse-sorted. So instead of just swapping the root with the last element, won't it be better to swap the first 3 elements (including the node and after the if necessary exchange of the 2nd and 3rd elements) with the last 3 elements, so that 2 subsequent heapifications (for the 2nd and 3rd elements ) are dispensed with?
Is there any disadvantage with this method (apart from the if-needed swapping of the 2nd and 3rd elements, which should be trivial)? If not, if it is indeed better, how much performance boost will it give? Here is the pseudo-code:
function heapify(a,i) {
#assumes node i has two nodes, both heaps. However node i itself might not be a heap, i.e one of its children may be greater than its value.
#if so, find the greater of its two children, then swp the parent with that value.
#this might render that child as no longer a heap, so recurse
}
function create_heap(a) {
#all elements following index |_n/2_| are leaf nodes, thus heapify() should be applied to all elements within index 1 to |_n/2_|
}
function heapsort(a) {
create_heap(a); #a is now a max-heap
#root of the heap, that is a[1] is the maximum, so swap it with a[n].
#The root now contains an element smaller than its children, both of which are themselves heaps.
#so apply heapify(a,1). Note: heap length is now n-1, as a[n] is the largest element already found
#now again the array is a heap. The highest value is in a[1]. Swap it with a[n-1].
#continue
}
Suppose the array is [4,1,3,2,16,9,10,14,8,7]. After running a heapify, it will become [16,14,10,8,7,9,3,2,4]. Now the heapsort's first iteration will swap 16 and 4, leading to [4,14,10,8,7,9,3,2,16]. Since this has now rendered the root of the new heap [4,14,10,8,7,9,3,2] as, umm, un-heaped, (14 and 10 both being greater than 4), run another heapify to produce [14,8,10,4,7,9,3,2]. Now 14 being the root, swap it with 2 to yield [2,8,10,4,7,9,3,14], thus making the array currently [2,8,10,4,7,9,3,14,16]. Again we find that 2 is un-heaped, so again doing a heapify makes the heap as [10,8,9,4,7,2,3]. Then 10 is swapped with 3, making the array as [3,8,9,4,7,2,3,10,14,16]. My point is that instead of doing the 2nd and 3rd heapifications to store 10 and 14 before 16, we can tell from the first heapification that because 10 and 14 follow 16, they are the 2nd and 3rd largest elements (or vice-versa). So after a comparison between them (in case they are already sorted, 14 comes before 10), I swap all the there (16,14,10) with (3,2,4), making the array [3,2,4,8,7,9,16,14,10]. This reduces us to a similar condition as the one after the further two heapifications - [3,8,9,4,7,2,3,10,14,16] originally, as compared to [3,2,4,8,7,9,16,14,10] now. Both will now need further heapification, but the 2nd method has let us arrive at this juncture directly by just a comparison between two elements (14 and 10).
The second largest element of the heap is present in the second or third position, but the third largest can be present further down, at depth 2. (See the figure in http://en.wikipedia.org/wiki/Heap_(data_structure) ). Furthermore, after swapping the first three elements with the last three, the heapify method would first heapify the first subtree of the root, followed by the second subtree of the root, followed by the whole tree. Thus the total cost of this operation is close to three times the cost of swapping the top element with the last and calling heapify. So you won't gain anything by doing this.
I want to remember the last n unique numbers, in order.
Here is what I mean: Let's say n = 4.
My current list is 5 3 4 2 If I add 6, it turns into 3 4 2 6. If I add 3 instead, the list turns into 5 4 2 3, where 3 moves to the front.
I would do it like this: Store the numbers in a queue. When adding a new number, search through the queue for the number. If the number is not found, pop the number at the end, and push the new number in the front. If the number is found, remove the number at that position, then push the new number in front.
Now obviously, removing a number from an arbitrary position in a queue, optimized for queue operations (like std::deque in C++) will be quite slow. Using a linked list, though will be slower to search through the list. Is there a better combination of algorithm + data structure to accomplish this sort of task?
If it makes any difference, I don't necessarily care about "remembering the last n unique numbers, in order." I specifically need to know, what element has been removed from the list upon an addition (if any).
You could use a doubly linked list. You can add your n numbers to be remembered in a hash table where the key is the number itself and the value a pointer that points to the node of the linked list that contains that number.
Then in the step you describe search through the queue for the number you change it for look if the number is in the hash table which will be constant time instead of liner time using the queue.
The pop and push operations you describe can be performed in constant time if you store a pointer p that points to the first element of the doubly linked list and a pointer q that points to the last element of your list.
Your step If the number is found, remove the number at that position can be performed in constant time since you already have the position of the number to be removed.(by position I mean the pointer you get from the hash table).
UPDATE:
Be careful that you must update your hash table to remove and add new numbers accordingly.
I have a situation where I want to delete a random node from the heap, what choices do I have? I know we can easily delete the last node and the first node of the heap. However if we say delete the last node, then I am not sure if the behavior is correctly defined for deleting a random node from the heap.
e.g.
_______________________
|X|12|13|14|18|20|21|22|
------------------------
So in this case I can delete the node 12 and 22, this is defined, but can I for example delete a random node, e.g. say 13, and still somehow maintain the complete tree property of the heap (along with other properties)?
I'm assuming that you're describing a binary heap maintained in an array, with the invariant that A[N] <= A[N*2] and A[N] <= A[N*2 + 1] (a min-heap).
If yes, then the approach to deletion is straightforward: replace the deleted element with the last element, and perform a sift-down to ensure that it ends up in the proper place. And, of course, decrement the variable that holds the total number of entries in the heap.
Incidentally, if you're working through heap examples, I find it better to use examples that do not have a total ordering. There's nothing in the definition of a heap that requires (eg) A[3] <= A[5], and it's easy to get misled if your examples have such an ordering.
I don't this think it is possible to remove random element from a heap. Let's take this example (following same convention):
3, 10, 4, 15, 20, 6, 5.
Now if I delete element 15, the heap becomes: 3, 10, 4, 5, 20, 6
This makes heap inconsistent because of 5 being child of 10.
The reason I think random deletion won't work is because you may substitute an inside node (instead of root or a leaf) in the heap, and thus there are two paths (parents and children) to heapify (as compared to 1 path in case of pop() or insert()).
Please let me know in case I am missing something here.